issue-#9: adding module Characters → #19
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Feature #9: adding module Characters
I tested Helmut implementation of new universal solution for unicode identifiers. Generally all works very well. We can think, how to include this in issues tracker and add to the development branch.
I have just started new project from the scratch in Redmine and new Repository in GitHub.
I have just started new project from the scratch in Redmine and new Repository in GitHub.
Re: Feature #9: adding module Characters
It will be difficult to add all those changes to Redmine, because there are a lot of modules affected.
To split the changes into several steps (groups), the result is not working by using UTf-8.
Only with Code Page the steps are working correctly.
The groups could be
1. Kernel & Strings (75.)
2. Centralise some Unicode character functions (21.)
3. Letter Я correction (71.)
4. Insert Strings.ToShort call for SHORT (72. = 22.)
5. Insert Strings.ToLong call for LONG (73.)
6. Insert Strings.ToLong call before Kernel.SplitName (74.)
7. Change from CodePage to Utf8 (76.)
Step 4. 5. and 6. are critical. They must all working before you use UTF-8
To split the changes into several steps (groups), the result is not working by using UTf-8.
Only with Code Page the steps are working correctly.
The groups could be
1. Kernel & Strings (75.)
2. Centralise some Unicode character functions (21.)
3. Letter Я correction (71.)
4. Insert Strings.ToShort call for SHORT (72. = 22.)
5. Insert Strings.ToLong call for LONG (73.)
6. Insert Strings.ToLong call before Kernel.SplitName (74.)
7. Change from CodePage to Utf8 (76.)
Step 4. 5. and 6. are critical. They must all working before you use UTF-8
- Josef Templ
- Posts: 2047
- Joined: Tue Sep 17, 2013 6:50 am
Re: Feature #9: adding module Characters
This is a complex issue.
Ideally, it should be discussed before even starting an issue.
The basic question is: do we want to change the symbol file format?
If YES, this allows completely different solutions than staying with the current
symbol file format.
Helmut's latest solution is based on changing the symbol file format (through the back door).
As far as I have tested it the solution works but it needs a lot of changes in a lot of modules.
With that number of changes, there may always be hidden bugs, nobody knows .
The code page approach had less impact on the sources.
- Josef
Ideally, it should be discussed before even starting an issue.
The basic question is: do we want to change the symbol file format?
If YES, this allows completely different solutions than staying with the current
symbol file format.
Helmut's latest solution is based on changing the symbol file format (through the back door).
As far as I have tested it the solution works but it needs a lot of changes in a lot of modules.
With that number of changes, there may always be hidden bugs, nobody knows .
The code page approach had less impact on the sources.
- Josef
Re: Feature #9: adding module Characters
Josef, I agree, it should be discuss before even starting.
The Advantage of code page version is
- More fault-tolerant.
- Easier to debug.
- Downward compatible.
The disadvantage is
- Code page is obsolete.
- Windows will not have code page in the future.
If you would like to go this way,
there is no difference about the number of modules to change.
Also the steps are the same.
The Advantage of code page version is
- More fault-tolerant.
- Easier to debug.
- Downward compatible.
The disadvantage is
- Code page is obsolete.
- Windows will not have code page in the future.
If you would like to go this way,
there is no difference about the number of modules to change.
Also the steps are the same.
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Feature #9: adding module Characters
We should not use code pages.
Helmut's solution works well and it supports all the cases of Unicode identifiers.
The solution has few bugs, but they are small (example, when modules unloaded the log massage is wrong).
I think, that we need to include this solution as fast as possible. And if it touch mach files it is not a problem, from my point of view.
Helmut's solution works well and it supports all the cases of Unicode identifiers.
The solution has few bugs, but they are small (example, when modules unloaded the log massage is wrong).
I think, that we need to include this solution as fast as possible. And if it touch mach files it is not a problem, from my point of view.
Re: Feature #9: adding module Characters
after having tried to digest the discussion so far, I'd like to contribute my opinion:
If we really want to include a "generalized" character set for identifiers (which appears to be desirable for merging the eastern and western community and to avoid further forking), we should avoid using code pages.
I am not very happy when the symbol file Format has to be changed, but if it cannot be avoided then this should be clear and open and "through the back door".
I am currently a bit dis-oriented about the current state of the issue/feature request.
The link posted by Ivan
--
Bernhard
If we really want to include a "generalized" character set for identifiers (which appears to be desirable for merging the eastern and western community and to avoid further forking), we should avoid using code pages.
I am not very happy when the symbol file Format has to be changed, but if it cannot be avoided then this should be clear and open and "through the back door".
I am currently a bit dis-oriented about the current state of the issue/feature request.
The link posted by Ivan
for the "new Project" does not work for me and I cannot find what changed in the second one.Ivan Denisov wrote: I have just started new project from the scratch in Redmine and new Repository in GitHub.
--
Bernhard
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Feature #9: adding module Characters
I am sorry, Bernhard, after the discussion with Josef the link had been changed to:bernhard wrote:The link posted by Ivanfor the "new Project" does not work for me and I cannot find what changed in the second one.Ivan Denisov wrote: I have just started new project from the scratch in Redmine and new Repository in GitHub.
http://redmine.blackboxframework.org/projects/blackbox
I forgot to fix it.
However the Helmut's solution is not yet in Center repository. You can evaluate it in last Zinn's version of BlackBox:
http://www.zinnamturm.eu/downloads.htm
- Josef Templ
- Posts: 2047
- Joined: Tue Sep 17, 2013 6:50 am
Re: Feature #9: adding module Characters
If I understand Helmut's latest solution correctly, the idea is to
support full Unicode identifiers in the compiler while still using ARRAY OF SHORTCHAR
internally for representing identifiers. This means that such an ARRAY OF SHORTCHAR
is not what you would usually expect from an ARRAY OF SHORTCHAR but it is a UTF-8
encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR!
You can see this 'trick' when you somewhere in the compiler insert a log message
that outputs an identifier, or you can also see it when you generate a TRAP and look at
strings that contain identifier names in the TRAP window.
Identifiers that contain extended ASCII characters are not displayed properly.
OK, this can be seen as a minor issue as long as you use ASCII characters only.
But in addition, this approach needs a conversion from Unicode to UTF-8 for all identifiers
in DevCPS.Identifier. OK, it seems to work, but is it an elegant solution?
If we want to support full Unicode identifiers, the obvious question is why not using ARRAY OF CHAR
instead of ARRAY OF SHORTCHAR in the compiler for representing identifiers?
I have not looked at the consequences of such an approach in detail yet,
but it seems to me to be a less 'tricky' approach.
The (expected) advantage would be that the compiler shows you all the places that need a conversion.
When writing symbol files, identifiers can still be converted to UTF-8 format in order to save space
for plain ASCII characters.
The fact that the latest approach is hard to get right can be seen by the following test program
that is accepted by the compiler but it should not be. The bug is related to checking the
maximum allowed identifier length, which varies for UTF-8 encoded strings.
MODULE TestMaxIdLen;
CONST
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ = 0;
x =
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ_ignoredPostfix;
END TestMaxIdLen.
To summarize, I don't want to say that the solution is bad. Actually, I am impressed that it works so well.
But it is tricky and may need some testing and thinking before a decision.
- Josef
support full Unicode identifiers in the compiler while still using ARRAY OF SHORTCHAR
internally for representing identifiers. This means that such an ARRAY OF SHORTCHAR
is not what you would usually expect from an ARRAY OF SHORTCHAR but it is a UTF-8
encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR!
You can see this 'trick' when you somewhere in the compiler insert a log message
that outputs an identifier, or you can also see it when you generate a TRAP and look at
strings that contain identifier names in the TRAP window.
Identifiers that contain extended ASCII characters are not displayed properly.
OK, this can be seen as a minor issue as long as you use ASCII characters only.
But in addition, this approach needs a conversion from Unicode to UTF-8 for all identifiers
in DevCPS.Identifier. OK, it seems to work, but is it an elegant solution?
If we want to support full Unicode identifiers, the obvious question is why not using ARRAY OF CHAR
instead of ARRAY OF SHORTCHAR in the compiler for representing identifiers?
I have not looked at the consequences of such an approach in detail yet,
but it seems to me to be a less 'tricky' approach.
The (expected) advantage would be that the compiler shows you all the places that need a conversion.
When writing symbol files, identifiers can still be converted to UTF-8 format in order to save space
for plain ASCII characters.
The fact that the latest approach is hard to get right can be seen by the following test program
that is accepted by the compiler but it should not be. The bug is related to checking the
maximum allowed identifier length, which varies for UTF-8 encoded strings.
MODULE TestMaxIdLen;
CONST
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ = 0;
x =
πππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππππ_ignoredPostfix;
END TestMaxIdLen.
To summarize, I don't want to say that the solution is bad. Actually, I am impressed that it works so well.
But it is tricky and may need some testing and thinking before a decision.
- Josef
- DGDanforth
- Posts: 1061
- Joined: Tue Sep 17, 2013 1:16 am
- Location: Palo Alto, California, USA
- Contact:
Re: Feature #9: adding module Characters
If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Feature #9: adding module Characters
I think, that Josef gave the right explanation and answer for your question:DGDanforth wrote:If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?
Josef Templ wrote:UTF-8 encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR