issue-#182 fixing code page conversion in RTF import

User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

Here are some fixes that I found after extensive debugging with the help of log output.
There are several severe structural bugs in the RTF import.
In particular, the current font is not maintained properly in \plain and when terminating a group with }.
Groups form a stack with respect to the font selection (and other attributes), i.e. the current font must be restored
to the previous value when terminating a group. This explains the wrong usage of \f3 (Symbol) in the example.

Some additional improvements have been added:
+ additional Mac character sets defined for \fcharset; as specified in the standard
+ special cases 0, 1, 2 refined for \fcharset; as specified in the standard
+ code page conversion only if using a font with code page different from 1252, which is the Latin-1 subset of Unicode.
+ code page conversion in Write improved, writes ? in case of error and uses luowy's syntax.
+ ansiCodePage initialized to 1252
+ Write calls replaced by WriteUnicode for writing multibyte characters
+ command \cpg added for setting the code page of a font
+ type Context moved into ParseRichText because it references FontInfo now.

See the diffs at https://redmine.blackboxframework.org/p ... 11caeb2b71.

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

luowy wrote: for western fonts(which one byte for one char), the Write procedure is good enongh to translate ,
but eastern fonts is multibytes(2 or 4)language, translate one byte will always wrong; (test the attachment file)
it need more code to do;
luowy
luowy, if you know the encoding rules for code page 936 (Simplified Chinese) it would be easy to convert it.
The problem is that we need to know which characters are multi-byte, 2 or 4 or whatever, and how to detect them.

Obviously, any character below 128 is a single byte character, and characters above seem to introduce multi-byte sequences.
Is it always a 2-byte sequence or can it also be a 4-byte sequence?
I cannot find any hint in the RTF standard that allows one to discover multi-byte sequences without
additional knowledge of the character set or code page involved.

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

WinApi.GetCPInfo gives us the information about cp936.

2-byte encoding is now supported for simplified Chinese and the hello.rtf example works.
See diffs at https://redmine.blackboxframework.org/p ... 0a07eae27d.

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

I think we can proceed with the voting about this issue.

- Josef
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: issue-#182 fixing code page conversion in RTF import

Post by Zinn »

Sorry on my Ubuntu Wine System the sample Hello.rtf does not work.
The result is: It displays four rectangles.
Open Hello.rtf in LibreOffice Writer works.
- Helmut
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

Zinn wrote:Sorry on my Ubuntu Wine System the sample Hello.rtf does not work.
The result is: It displays four rectangles.
Open Hello.rtf in LibreOffice Writer works.
- Helmut
Helmut, can you "Copy&Paste" it from LibreOffice as RTF into BlackBox?
If this works, the font is OK, i.e. able to show Chinese glyphs.
If this does not work either, it may be the font.

It is also possible that WinApi.MultiByteToWideChar does not work properly
for 2-byte encodings under wine.

In either case, I have no idea how to fix this.

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

I just reactivated LibreOffice Writer on my Debian and tried the Copy&Paste.
It does not work. Also Paste Special as Unicode fails.
It seems to me that the wine fonts don't have the Chinese glyphs.

It is possible that there are special wine options/extensions available but
as far as I currently understand it there is nothing we can do in HostTextConv.

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

I also checked WinApi.MultiByteToWideChar under wine.
It returns the same results as under Windows. This means that the problem
must be somehow connected with the font.

For example, I can copy & paste the 4 boxes from BlackBox to Libre Office Writer,
which displays it correctly!
When I copy it to Wordpad, it does not display it correctly.
However, Wordpad is able to open the file correctly.
There may be something like a font substitution going on.

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: issue-#182 fixing code page conversion in RTF import

Post by Josef Templ »

Just discovered that when I (under Debian 9 with wine) select the font
named @Droid_Sans_Fallback then BlackBox displays the text correctly.
This is a font substitution done manually.

- Josef
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: issue-#182 fixing code page conversion in RTF import

Post by Zinn »

Ok, it is a problem of the used operationg system and its fonts. The translation is corrent an can be moved with cut and paste to another program.

One other topic: You change the output to ? when you can't translate. In my opnion it is better to change it back to equal the input ch as it was before we starting this issue. You may have a change to correct the untranslated part. With question mark all information are lost.

- Helmut
Post Reply