issue-#19: Unicode for Component Pascal identifiers
- Josef Templ
- Posts: 2047
- Joined: Tue Sep 17, 2013 6:50 am
issue-#19: Unicode for Component Pascal identifiers
I have created redmine issue-#19 for "Adding full Unicode support for Component Pascal identifiers"
and a topic branch. See http://redmine.blackboxframework.org/issues/19
and https://github.com/BlackBoxCenter/black ... ssue-%2319.
Discussion of this issue has been started under http://forum.blackboxframework.org/view ... f=41&t=109
using a different approach, which is now obsolete. Discussion should be continued in this thread.
I will try to collect the related changes until next week.
- Josef
and a topic branch. See http://redmine.blackboxframework.org/issues/19
and https://github.com/BlackBoxCenter/black ... ssue-%2319.
Discussion of this issue has been started under http://forum.blackboxframework.org/view ... f=41&t=109
using a different approach, which is now obsolete. Discussion should be continued in this thread.
I will try to collect the related changes until next week.
- Josef
- DGDanforth
- Posts: 1061
- Joined: Tue Sep 17, 2013 1:16 am
- Location: Palo Alto, California, USA
- Contact:
Re: Issue #19: Unicode for Component Pascal identifiers
From the CP language report
"Every identifier occurring in a program must be introduced by a declaration, unless it is a predeclared identifier. Declarations also specify certain permanent properties of an object, such as whether it is a constant, a type, a variable, or a procedure. The identifier is then used to refer to the associated object."
(emphasis added)
I write that simply to make sure I understand where Unicode will be used.
The report also says "Unicode (16 bit) characters are allowed in string constants only."
I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?
"Every identifier occurring in a program must be introduced by a declaration, unless it is a predeclared identifier. Declarations also specify certain permanent properties of an object, such as whether it is a constant, a type, a variable, or a procedure. The identifier is then used to refer to the associated object."
(emphasis added)
I write that simply to make sure I understand where Unicode will be used.
The report also says "Unicode (16 bit) characters are allowed in string constants only."
I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Issue #19: Unicode for Component Pascal identifiers
Yes, it is handled already. No problems with comments in Unicode.DGDanforth wrote:I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?
- DGDanforth
- Posts: 1061
- Joined: Tue Sep 17, 2013 1:16 am
- Location: Palo Alto, California, USA
- Contact:
Re: Issue #19: Unicode for Component Pascal identifiers
Josef,
There seems to be a conflict with "Adding full Unicode support for Component Pascal identifiers" and the limitation to 16 bits for the editor. It takes 17 bits to encompass the current Unicode range.
Internally I understand how SHORTCHAR strings can be used for UTF-8 and hence all of Unicode but how, with the editor, do you get such strings into a document that only supports 16 bit characters?
There seems to be a conflict with "Adding full Unicode support for Component Pascal identifiers" and the limitation to 16 bits for the editor. It takes 17 bits to encompass the current Unicode range.
Internally I understand how SHORTCHAR strings can be used for UTF-8 and hence all of Unicode but how, with the editor, do you get such strings into a document that only supports 16 bit characters?
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Issue #19: Unicode for Component Pascal identifiers
Doug right, we should rename issue to "Adding 16bits Unicode support for Component Pascal identifiers"
- Josef Templ
- Posts: 2047
- Joined: Tue Sep 17, 2013 6:50 am
Re: Issue #19: Unicode for Component Pascal identifiers
A list of changes can be found here:
http://redmine.blackboxframework.org/pr ... 83e3294a69
Have fun! It is a long list.
Let me know if there are any files missing.
- Josef
http://redmine.blackboxframework.org/pr ... 83e3294a69
Have fun! It is a long list.
Let me know if there are any files missing.
- Josef
-
- Posts: 1700
- Joined: Tue Sep 17, 2013 12:21 am
- Location: Russia
Re: Issue #19: Unicode for Component Pascal identifiers
Josef, I updated Script/Mod/DevCPT.odc and the build pipeline just has built new BlackBox:
blackbox-1.7-a1.020.zip
blackbox-1.7-a1.020-setup.exe
The differences listed here.
I downloaded it and tried some example with Cyrillic. All seems to work fine!
blackbox-1.7-a1.020.zip
blackbox-1.7-a1.020-setup.exe
The differences listed here.
I downloaded it and tried some example with Cyrillic. All seems to work fine!
- DGDanforth
- Posts: 1061
- Joined: Tue Sep 17, 2013 1:16 am
- Location: Palo Alto, California, USA
- Contact:
Re: Issue #19: Unicode for Component Pascal identifiers
It appears that the TYPE of a character is effectively changed from SHORTCHAR to CHAR.
That is not what I expected. I had expected that a new TYPE Utf8 would be used where Utf8 is an array of SHORTCHAR. The array length of a variable of Utf8 would then be variable in the range 1,2, or 3. Most of the time it would use a single byte. Perhaps there would be 3 predefined types Utf8_1, Utf8_2, Utf8_3 so that the overhead of a pointer to Utf8 would be avoided.
But then again as Josef has said this code is only used for the programming aspects of BlackBox and 16 bits is sufficient in that case.
That is not what I expected. I had expected that a new TYPE Utf8 would be used where Utf8 is an array of SHORTCHAR. The array length of a variable of Utf8 would then be variable in the range 1,2, or 3. Most of the time it would use a single byte. Perhaps there would be 3 predefined types Utf8_1, Utf8_2, Utf8_3 so that the overhead of a pointer to Utf8 would be avoided.
But then again as Josef has said this code is only used for the programming aspects of BlackBox and 16 bits is sufficient in that case.
- Josef Templ
- Posts: 2047
- Joined: Tue Sep 17, 2013 6:50 am
Re: Issue #19: Unicode for Component Pascal identifiers
Doug, in a BlackBox 1.6 text document the characters are encoded using 16-bit Unicode.
With the issue-#19 changes the CP compiler reads the Unicode and when it builds up
an identifier, it uses the Utf8 encoding internally for representing the name of that identifier.
This is compact for identifiers that consists mostly of ASCII characters
and identical to the 1.6 version for identifiers that consist only of ASCII characters.
The Utf-8 format is then also used without any further conversion
for externalizing data to the symbol or object file.
Thanks to Ivan for replicating the DevCPM changes to the scripting engine.
This shows that the redundancy in the subsystem Script (slightly modified copies of
DevCPM and HostFiles) is a potential source of inconsistencies and should be
eliminated better sooner than later.
- Josef
With the issue-#19 changes the CP compiler reads the Unicode and when it builds up
an identifier, it uses the Utf8 encoding internally for representing the name of that identifier.
This is compact for identifiers that consists mostly of ASCII characters
and identical to the 1.6 version for identifiers that consist only of ASCII characters.
The Utf-8 format is then also used without any further conversion
for externalizing data to the symbol or object file.
Thanks to Ivan for replicating the DevCPM changes to the scripting engine.
This shows that the redundancy in the subsystem Script (slightly modified copies of
DevCPM and HostFiles) is a potential source of inconsistencies and should be
eliminated better sooner than later.
- Josef
- DGDanforth
- Posts: 1061
- Joined: Tue Sep 17, 2013 1:16 am
- Location: Palo Alto, California, USA
- Contact:
Re: Issue #19: Unicode for Component Pascal identifiers
Summary(?)
(1.6) 2 byte text input =>1 byte output
(1.7) 2 byte text input =>Utf8 (variable bytes) output
When 2 byte text is ASCII then the two (1.6, 1.7) output forms are identical.
I still have a problem. If the high bit {15} of the input 16 bit Unicode is set then the resultant Utf8 will be 3 bytes.
That does not correspond to Helmut's comments that he uses only 16 bits.
Does that mean that input 16 bit Unicode is actually restricted to 15 bits?
I am assuming that the Utf8 rule states that the high bit of a byte is a 'continuation' bit such that if set then another byte follows.
(1.6) 2 byte text input =>1 byte output
(1.7) 2 byte text input =>Utf8 (variable bytes) output
When 2 byte text is ASCII then the two (1.6, 1.7) output forms are identical.
I still have a problem. If the high bit {15} of the input 16 bit Unicode is set then the resultant Utf8 will be 3 bytes.
That does not correspond to Helmut's comments that he uses only 16 bits.
Does that mean that input 16 bit Unicode is actually restricted to 15 bits?
I am assuming that the Utf8 rule states that the high bit of a byte is a 'continuation' bit such that if set then another byte follows.