issue-#182 fixing code page conversion in RTF import

Josef Templ · Post by **Josef Templ** » Tue Nov 28, 2017 11:44 pm

Here are some fixes that I found after extensive debugging with the help of log output.
There are several severe structural bugs in the RTF import.
In particular, the current font is not maintained properly in \plain and when terminating a group with }.
Groups form a stack with respect to the font selection (and other attributes), i.e. the current font must be restored
to the previous value when terminating a group. This explains the wrong usage of \f3 (Symbol) in the example.

Some additional improvements have been added:
+ additional Mac character sets defined for \fcharset; as specified in the standard
+ special cases 0, 1, 2 refined for \fcharset; as specified in the standard
+ code page conversion only if using a font with code page different from 1252, which is the Latin-1 subset of Unicode.
+ code page conversion in Write improved, writes ? in case of error and uses luowy's syntax.
+ ansiCodePage initialized to 1252
+ Write calls replaced by WriteUnicode for writing multibyte characters
+ command \cpg added for setting the code page of a font
+ type Context moved into ParseRichText because it references FontInfo now.

See the diffs at https://redmine.blackboxframework.org/p ... 11caeb2b71.

- Josef

Josef Templ · Post by **Josef Templ** » Wed Nov 29, 2017 7:42 pm

luowy wrote: for western fonts(which one byte for one char), the Write procedure is good enongh to translate ,
but eastern fonts is multibytes(2 or 4)language, translate one byte will always wrong; (test the attachment file)
it need more code to do;
luowy

luowy, if you know the encoding rules for code page 936 (Simplified Chinese) it would be easy to convert it.
The problem is that we need to know which characters are multi-byte, 2 or 4 or whatever, and how to detect them.

Obviously, any character below 128 is a single byte character, and characters above seem to introduce multi-byte sequences.
Is it always a 2-byte sequence or can it also be a 4-byte sequence?
I cannot find any hint in the RTF standard that allows one to discover multi-byte sequences without
additional knowledge of the character set or code page involved.

- Josef

Josef Templ · Post by **Josef Templ** » Thu Nov 30, 2017 1:13 am

WinApi.GetCPInfo gives us the information about cp936.

2-byte encoding is now supported for simplified Chinese and the hello.rtf example works.
See diffs at https://redmine.blackboxframework.org/p ... 0a07eae27d.

- Josef

Josef Templ · Post by **Josef Templ** » Thu Nov 30, 2017 8:13 am

I think we can proceed with the voting about this issue.

- Josef

Zinn · Post by **Zinn** » Thu Nov 30, 2017 10:53 am

Sorry on my Ubuntu Wine System the sample Hello.rtf does not work.
The result is: It displays four rectangles.
Open Hello.rtf in LibreOffice Writer works.
- Helmut

Josef Templ · Post by **Josef Templ** » Thu Nov 30, 2017 11:59 am

Zinn wrote:Sorry on my Ubuntu Wine System the sample Hello.rtf does not work.
The result is: It displays four rectangles.
Open Hello.rtf in LibreOffice Writer works.
- Helmut

Helmut, can you "Copy&Paste" it from LibreOffice as RTF into BlackBox?
If this works, the font is OK, i.e. able to show Chinese glyphs.
If this does not work either, it may be the font.

It is also possible that WinApi.MultiByteToWideChar does not work properly
for 2-byte encodings under wine.

In either case, I have no idea how to fix this.

- Josef

Josef Templ · Post by **Josef Templ** » Thu Nov 30, 2017 1:16 pm

I just reactivated LibreOffice Writer on my Debian and tried the Copy&Paste.
It does not work. Also Paste Special as Unicode fails.
It seems to me that the wine fonts don't have the Chinese glyphs.

It is possible that there are special wine options/extensions available but
as far as I currently understand it there is nothing we can do in HostTextConv.

- Josef

Josef Templ · Post by **Josef Templ** » Thu Nov 30, 2017 1:54 pm

I also checked WinApi.MultiByteToWideChar under wine.
It returns the same results as under Windows. This means that the problem
must be somehow connected with the font.

For example, I can copy & paste the 4 boxes from BlackBox to Libre Office Writer,
which displays it correctly!
When I copy it to Wordpad, it does not display it correctly.
However, Wordpad is able to open the file correctly.
There may be something like a font substitution going on.

- Josef

Josef Templ · Post by **Josef Templ** » Thu Nov 30, 2017 2:33 pm

Just discovered that when I (under Debian 9 with wine) select the font
named @Droid_Sans_Fallback then BlackBox displays the text correctly.
This is a font substitution done manually.

- Josef

Zinn · Post by **Zinn** » Fri Dec 01, 2017 7:05 am

Ok, it is a problem of the used operationg system and its fonts. The translation is corrent an can be moved with cut and paste to another program.

One other topic: You change the output to ? when you can't translate. In my opnion it is better to change it back to equal the input ch as it was before we starting this issue. You may have a change to correct the untranslated part. With question mark all information are lost.

- Helmut

BlackBox Framework Center

issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import

Re: issue-#182 fixing code page conversion in RTF import