issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
Post Reply
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

issue-#19: Unicode for Component Pascal identifiers

Post by Josef Templ »

I have created redmine issue-#19 for "Adding full Unicode support for Component Pascal identifiers"
and a topic branch. See http://redmine.blackboxframework.org/issues/19
and https://github.com/BlackBoxCenter/black ... ssue-%2319.

Discussion of this issue has been started under http://forum.blackboxframework.org/view ... f=41&t=109
using a different approach, which is now obsolete. Discussion should be continued in this thread.

I will try to collect the related changes until next week.

- Josef
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

From the CP language report

"Every identifier occurring in a program must be introduced by a declaration, unless it is a predeclared identifier. Declarations also specify certain permanent properties of an object, such as whether it is a constant, a type, a variable, or a procedure. The identifier is then used to refer to the associated object."

(emphasis added)
I write that simply to make sure I understand where Unicode will be used.

The report also says "Unicode (16 bit) characters are allowed in string constants only."

I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

DGDanforth wrote:I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?
Yes, it is handled already. No problems with comments in Unicode.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Josef,
There seems to be a conflict with "Adding full Unicode support for Component Pascal identifiers" and the limitation to 16 bits for the editor. It takes 17 bits to encompass the current Unicode range.

Internally I understand how SHORTCHAR strings can be used for UTF-8 and hence all of Unicode but how, with the editor, do you get such strings into a document that only supports 16 bit characters?
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Doug right, we should rename issue to "Adding 16bits Unicode support for Component Pascal identifiers"
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

A list of changes can be found here:

http://redmine.blackboxframework.org/pr ... 83e3294a69

Have fun! It is a long list.
Let me know if there are any files missing.

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef, I updated Script/Mod/DevCPT.odc and the build pipeline just has built new BlackBox:

blackbox-1.7-a1.020.zip
blackbox-1.7-a1.020-setup.exe

The differences listed here.

I downloaded it and tried some example with Cyrillic. All seems to work fine!
cyrillic.png
cyrillic.png (20.93 KiB) Viewed 19794 times
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

It appears that the TYPE of a character is effectively changed from SHORTCHAR to CHAR.
That is not what I expected. I had expected that a new TYPE Utf8 would be used where Utf8 is an array of SHORTCHAR. The array length of a variable of Utf8 would then be variable in the range 1,2, or 3. Most of the time it would use a single byte. Perhaps there would be 3 predefined types Utf8_1, Utf8_2, Utf8_3 so that the overhead of a pointer to Utf8 would be avoided.

But then again as Josef has said this code is only used for the programming aspects of BlackBox and 16 bits is sufficient in that case.
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

Doug, in a BlackBox 1.6 text document the characters are encoded using 16-bit Unicode.
With the issue-#19 changes the CP compiler reads the Unicode and when it builds up
an identifier, it uses the Utf8 encoding internally for representing the name of that identifier.
This is compact for identifiers that consists mostly of ASCII characters
and identical to the 1.6 version for identifiers that consist only of ASCII characters.
The Utf-8 format is then also used without any further conversion
for externalizing data to the symbol or object file.

Thanks to Ivan for replicating the DevCPM changes to the scripting engine.
This shows that the redundancy in the subsystem Script (slightly modified copies of
DevCPM and HostFiles) is a potential source of inconsistencies and should be
eliminated better sooner than later.

- Josef
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Summary(?)
(1.6) 2 byte text input =>1 byte output
(1.7) 2 byte text input =>Utf8 (variable bytes) output

When 2 byte text is ASCII then the two (1.6, 1.7) output forms are identical.

I still have a problem. If the high bit {15} of the input 16 bit Unicode is set then the resultant Utf8 will be 3 bytes.
That does not correspond to Helmut's comments that he uses only 16 bits.

Does that mean that input 16 bit Unicode is actually restricted to 15 bits?

I am assuming that the Utf8 rule states that the high bit of a byte is a 'continuation' bit such that if set then another byte follows.
Post Reply