issue-#9: adding module Characters → #19

Ivan Denisov · Post by **Ivan Denisov** » Sun Sep 07, 2014 2:43 pm

I tested Helmut implementation of new universal solution for unicode identifiers. Generally all works very well. We can think, how to include this in issues tracker and add to the development branch.

I have just started new project from the scratch in Redmine and new Repository in GitHub.

Zinn · Post by **Zinn** » Sun Sep 21, 2014 6:47 am

It will be difficult to add all those changes to Redmine, because there are a lot of modules affected.
To split the changes into several steps (groups), the result is not working by using UTf-8.
Only with Code Page the steps are working correctly.

The groups could be
1. Kernel & Strings (75.)
2. Centralise some Unicode character functions (21.)
3. Letter Я correction (71.)
4. Insert Strings.ToShort call for SHORT (72. = 22.)
5. Insert Strings.ToLong call for LONG (73.)
6. Insert Strings.ToLong call before Kernel.SplitName (74.)
7. Change from CodePage to Utf8 (76.)

Step 4. 5. and 6. are critical. They must all working before you use UTF-8

Josef Templ · Post by **Josef Templ** » Sun Sep 21, 2014 8:33 am

This is a complex issue.
Ideally, it should be discussed before even starting an issue.

The basic question is: do we want to change the symbol file format?
If YES, this allows completely different solutions than staying with the current
symbol file format.

Helmut's latest solution is based on changing the symbol file format (through the back door).
As far as I have tested it the solution works but it needs a lot of changes in a lot of modules.
With that number of changes, there may always be hidden bugs, nobody knows .
The code page approach had less impact on the sources.

- Josef

Zinn · Post by **Zinn** » Sun Sep 21, 2014 9:19 am

Josef, I agree, it should be discuss before even starting.

The Advantage of code page version is
- More fault-tolerant.
- Easier to debug.
- Downward compatible.

The disadvantage is
- Code page is obsolete.
- Windows will not have code page in the future.

If you would like to go this way,
there is no difference about the number of modules to change.
Also the steps are the same.

Ivan Denisov · Post by **Ivan Denisov** » Sun Sep 21, 2014 4:50 pm

We should not use code pages.
Helmut's solution works well and it supports all the cases of Unicode identifiers.
The solution has few bugs, but they are small (example, when modules unloaded the log massage is wrong).
I think, that we need to include this solution as fast as possible. And if it touch mach files it is not a problem, from my point of view.

Bernhard · Post by **Bernhard** » Mon Sep 22, 2014 2:47 pm

after having tried to digest the discussion so far, I'd like to contribute my opinion:

If we really want to include a "generalized" character set for identifiers (which appears to be desirable for merging the eastern and western community and to avoid further forking), we should avoid using code pages.

I am not very happy when the symbol file Format has to be changed, but if it cannot be avoided then this should be clear and open and "through the back door".

I am currently a bit dis-oriented about the current state of the issue/feature request.

The link posted by Ivan

Ivan Denisov wrote: I have just started new project from the scratch in Redmine and new Repository in GitHub.

for the "new Project" does not work for me and I cannot find what changed in the second one.

--
Bernhard

Ivan Denisov · Post by **Ivan Denisov** » Mon Sep 22, 2014 11:31 pm

bernhard wrote:The link posted by Ivan
Ivan Denisov wrote: I have just started new project from the scratch in Redmine and new Repository in GitHub.
for the "new Project" does not work for me and I cannot find what changed in the second one.

I am sorry, Bernhard, after the discussion with Josef the link had been changed to:
http://redmine.blackboxframework.org/projects/blackbox
I forgot to fix it.

However the Helmut's solution is not yet in Center repository. You can evaluate it in last Zinn's version of BlackBox:
http://www.zinnamturm.eu/downloads.htm

Josef Templ · Post by **Josef Templ** » Tue Sep 23, 2014 8:20 am

If I understand Helmut's latest solution correctly, the idea is to
support full Unicode identifiers in the compiler while still using ARRAY OF SHORTCHAR
internally for representing identifiers. This means that such an ARRAY OF SHORTCHAR
is not what you would usually expect from an ARRAY OF SHORTCHAR but it is a UTF-8
encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR!

You can see this 'trick' when you somewhere in the compiler insert a log message
that outputs an identifier, or you can also see it when you generate a TRAP and look at
strings that contain identifier names in the TRAP window.
Identifiers that contain extended ASCII characters are not displayed properly.
OK, this can be seen as a minor issue as long as you use ASCII characters only.
But in addition, this approach needs a conversion from Unicode to UTF-8 for all identifiers
in DevCPS.Identifier. OK, it seems to work, but is it an elegant solution?

If we want to support full Unicode identifiers, the obvious question is why not using ARRAY OF CHAR
instead of ARRAY OF SHORTCHAR in the compiler for representing identifiers?
I have not looked at the consequences of such an approach in detail yet,
but it seems to me to be a less 'tricky' approach.
The (expected) advantage would be that the compiler shows you all the places that need a conversion.
When writing symbol files, identifiers can still be converted to UTF-8 format in order to save space
for plain ASCII characters.

The fact that the latest approach is hard to get right can be seen by the following test program
that is accepted by the compiler but it should not be. The bug is related to checking the
maximum allowed identifier length, which varies for UTF-8 encoded strings.

MODULE TestMaxIdLen;

CONST
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ = 0;

x =
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ_ignoredPostfix;

END TestMaxIdLen.

To summarize, I don't want to say that the solution is bad. Actually, I am impressed that it works so well.
But it is tricky and may need some testing and thinking before a decision.

- Josef

DGDanforth · Post by **DGDanforth** » Tue Sep 23, 2014 10:02 am

If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?

Ivan Denisov · Post by **Ivan Denisov** » Tue Sep 23, 2014 10:27 am

DGDanforth wrote:If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?

I think, that Josef gave the right explanation and answer for your question:

Josef Templ wrote:UTF-8 encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR

BlackBox Framework Center

issue-#9: adding module Characters → #19

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters