issue-#9: adding module Characters → #19

Merged to the master branch
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Feature #9: adding module Characters

Post by Ivan Denisov »

I tested Helmut implementation of new universal solution for unicode identifiers. Generally all works very well. We can think, how to include this in issues tracker and add to the development branch.

I have just started new project from the scratch in Redmine and new Repository in GitHub.
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Feature #9: adding module Characters

Post by Zinn »

It will be difficult to add all those changes to Redmine, because there are a lot of modules affected.
To split the changes into several steps (groups), the result is not working by using UTf-8.
Only with Code Page the steps are working correctly.

The groups could be
1. Kernel & Strings (75.)
2. Centralise some Unicode character functions (21.)
3. Letter Я correction (71.)
4. Insert Strings.ToShort call for SHORT (72. = 22.)
5. Insert Strings.ToLong call for LONG (73.)
6. Insert Strings.ToLong call before Kernel.SplitName (74.)
7. Change from CodePage to Utf8 (76.)

Step 4. 5. and 6. are critical. They must all working before you use UTF-8
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Feature #9: adding module Characters

Post by Josef Templ »

This is a complex issue.
Ideally, it should be discussed before even starting an issue.

The basic question is: do we want to change the symbol file format?
If YES, this allows completely different solutions than staying with the current
symbol file format.

Helmut's latest solution is based on changing the symbol file format (through the back door).
As far as I have tested it the solution works but it needs a lot of changes in a lot of modules.
With that number of changes, there may always be hidden bugs, nobody knows .
The code page approach had less impact on the sources.

- Josef
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Feature #9: adding module Characters

Post by Zinn »

Josef, I agree, it should be discuss before even starting.

The Advantage of code page version is
- More fault-tolerant.
- Easier to debug.
- Downward compatible.

The disadvantage is
- Code page is obsolete.
- Windows will not have code page in the future.

If you would like to go this way,
there is no difference about the number of modules to change.
Also the steps are the same.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Feature #9: adding module Characters

Post by Ivan Denisov »

We should not use code pages.
Helmut's solution works well and it supports all the cases of Unicode identifiers.
The solution has few bugs, but they are small (example, when modules unloaded the log massage is wrong).
I think, that we need to include this solution as fast as possible. And if it touch mach files it is not a problem, from my point of view.
Bernhard
Posts: 68
Joined: Tue Sep 17, 2013 6:56 am
Location: Munich, Germany

Re: Feature #9: adding module Characters

Post by Bernhard »

after having tried to digest the discussion so far, I'd like to contribute my opinion:

If we really want to include a "generalized" character set for identifiers (which appears to be desirable for merging the eastern and western community and to avoid further forking), we should avoid using code pages.

I am not very happy when the symbol file Format has to be changed, but if it cannot be avoided then this should be clear and open and "through the back door".

I am currently a bit dis-oriented about the current state of the issue/feature request.

The link posted by Ivan
Ivan Denisov wrote: I have just started new project from the scratch in Redmine and new Repository in GitHub.
for the "new Project" does not work for me and I cannot find what changed in the second one.

--
Bernhard
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Feature #9: adding module Characters

Post by Ivan Denisov »

bernhard wrote:The link posted by Ivan
Ivan Denisov wrote: I have just started new project from the scratch in Redmine and new Repository in GitHub.
for the "new Project" does not work for me and I cannot find what changed in the second one.
I am sorry, Bernhard, after the discussion with Josef the link had been changed to:
http://redmine.blackboxframework.org/projects/blackbox
I forgot to fix it.

However the Helmut's solution is not yet in Center repository. You can evaluate it in last Zinn's version of BlackBox:
http://www.zinnamturm.eu/downloads.htm
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Feature #9: adding module Characters

Post by Josef Templ »

If I understand Helmut's latest solution correctly, the idea is to
support full Unicode identifiers in the compiler while still using ARRAY OF SHORTCHAR
internally for representing identifiers. This means that such an ARRAY OF SHORTCHAR
is not what you would usually expect from an ARRAY OF SHORTCHAR but it is a UTF-8
encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR!

You can see this 'trick' when you somewhere in the compiler insert a log message
that outputs an identifier, or you can also see it when you generate a TRAP and look at
strings that contain identifier names in the TRAP window.
Identifiers that contain extended ASCII characters are not displayed properly.
OK, this can be seen as a minor issue as long as you use ASCII characters only.
But in addition, this approach needs a conversion from Unicode to UTF-8 for all identifiers
in DevCPS.Identifier. OK, it seems to work, but is it an elegant solution?

If we want to support full Unicode identifiers, the obvious question is why not using ARRAY OF CHAR
instead of ARRAY OF SHORTCHAR in the compiler for representing identifiers?
I have not looked at the consequences of such an approach in detail yet,
but it seems to me to be a less 'tricky' approach.
The (expected) advantage would be that the compiler shows you all the places that need a conversion.
When writing symbol files, identifiers can still be converted to UTF-8 format in order to save space
for plain ASCII characters.

The fact that the latest approach is hard to get right can be seen by the following test program
that is accepted by the compiler but it should not be. The bug is related to checking the
maximum allowed identifier length, which varies for UTF-8 encoded strings.

MODULE TestMaxIdLen;

CONST
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ = 0;

x =
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ_ignoredPostfix;

END TestMaxIdLen.

To summarize, I don't want to say that the solution is bad. Actually, I am impressed that it works so well.
But it is tricky and may need some testing and thinking before a decision.

- Josef
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Feature #9: adding module Characters

Post by Ivan Denisov »

DGDanforth wrote:If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?
I think, that Josef gave the right explanation and answer for your question:
Josef Templ wrote:UTF-8 encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR
Post Reply