issue-#19: Unicode for Component Pascal identifiers

Josef Templ · Post by **Josef Templ** » Thu Oct 23, 2014 10:42 am

I have created redmine issue-#19 for "Adding full Unicode support for Component Pascal identifiers"
and a topic branch. See http://redmine.blackboxframework.org/issues/19
and https://github.com/BlackBoxCenter/black ... ssue-%2319.

Discussion of this issue has been started under http://forum.blackboxframework.org/view ... f=41&t=109
using a different approach, which is now obsolete. Discussion should be continued in this thread.

I will try to collect the related changes until next week.

- Josef

DGDanforth · Post by **DGDanforth** » Fri Oct 24, 2014 6:40 am

From the CP language report

"Every identifier occurring in a program must be introduced by a declaration, unless it is a predeclared identifier. Declarations also specify certain permanent properties of an object, such as whether it is a constant, a type, a variable, or a procedure. The identifier is then used to refer to the associated object."

(emphasis added)
I write that simply to make sure I understand where Unicode will be used.

The report also says "Unicode (16 bit) characters are allowed in string constants only."

I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?

Ivan Denisov · Post by **Ivan Denisov** » Fri Oct 24, 2014 7:15 am

DGDanforth wrote:I now ask "What about comments"? Surely Russian speakers would want to be able to comment
in Russian, or is that already handled by the compiler by simply ignoring comments?

Yes, it is handled already. No problems with comments in Unicode.

DGDanforth · Post by **DGDanforth** » Mon Oct 27, 2014 10:00 pm

Josef,
There seems to be a conflict with "Adding full Unicode support for Component Pascal identifiers" and the limitation to 16 bits for the editor. It takes 17 bits to encompass the current Unicode range.

Internally I understand how SHORTCHAR strings can be used for UTF-8 and hence all of Unicode but how, with the editor, do you get such strings into a document that only supports 16 bit characters?

Ivan Denisov · Post by **Ivan Denisov** » Tue Oct 28, 2014 3:58 am

Doug right, we should rename issue to "Adding 16bits Unicode support for Component Pascal identifiers"

Josef Templ · Post by **Josef Templ** » Tue Oct 28, 2014 9:14 am

A list of changes can be found here:

http://redmine.blackboxframework.org/pr ... 83e3294a69

Have fun! It is a long list.
Let me know if there are any files missing.

- Josef

Ivan Denisov · Post by **Ivan Denisov** » Tue Oct 28, 2014 4:56 pm

Josef, I updated Script/Mod/DevCPT.odc and the build pipeline just has built new BlackBox:

blackbox-1.7-a1.020.zip
blackbox-1.7-a1.020-setup.exe

The differences listed here.

I downloaded it and tried some example with Cyrillic. All seems to work fine!

: cyrillic.png (20.93 KiB) Viewed 29811 times

DGDanforth · Post by **DGDanforth** » Tue Oct 28, 2014 6:25 pm

It appears that the TYPE of a character is effectively changed from SHORTCHAR to CHAR.
That is not what I expected. I had expected that a new TYPE Utf8 would be used where Utf8 is an array of SHORTCHAR. The array length of a variable of Utf8 would then be variable in the range 1,2, or 3. Most of the time it would use a single byte. Perhaps there would be 3 predefined types Utf8_1, Utf8_2, Utf8_3 so that the overhead of a pointer to Utf8 would be avoided.

But then again as Josef has said this code is only used for the programming aspects of BlackBox and 16 bits is sufficient in that case.

Josef Templ · Post by **Josef Templ** » Wed Oct 29, 2014 3:09 am

Doug, in a BlackBox 1.6 text document the characters are encoded using 16-bit Unicode.
With the issue-#19 changes the CP compiler reads the Unicode and when it builds up
an identifier, it uses the Utf8 encoding internally for representing the name of that identifier.
This is compact for identifiers that consists mostly of ASCII characters
and identical to the 1.6 version for identifiers that consist only of ASCII characters.
The Utf-8 format is then also used without any further conversion
for externalizing data to the symbol or object file.

Thanks to Ivan for replicating the DevCPM changes to the scripting engine.
This shows that the redundancy in the subsystem Script (slightly modified copies of
DevCPM and HostFiles) is a potential source of inconsistencies and should be
eliminated better sooner than later.

- Josef

DGDanforth · Post by **DGDanforth** » Thu Oct 30, 2014 9:29 pm

Summary(?)
(1.6) 2 byte text input =>1 byte output
(1.7) 2 byte text input =>Utf8 (variable bytes) output

When 2 byte text is ASCII then the two (1.6, 1.7) output forms are identical.

I still have a problem. If the high bit {15} of the input 16 bit Unicode is set then the resultant Utf8 will be 3 bytes.
That does not correspond to Helmut's comments that he uses only 16 bits.

Does that mean that input 16 bit Unicode is actually restricted to 15 bits?

I am assuming that the Utf8 rule states that the high bit of a byte is a 'continuation' bit such that if set then another byte follows.

BlackBox Framework Center

issue-#19: Unicode for Component Pascal identifiers

issue-#19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers