issue-#19: Unicode for Component Pascal identifiers

Josef Templ · Post by **Josef Templ** » Thu Nov 20, 2014 9:04 am

BB 1.6 uses the Utf8-format for supporting 16 bit Unicode characters within string constants only.
The idea is the same as in Helmut's solution. I don't know if Helmut was aware of that fact when he
introduced Utf8 in the compiler but probably he was.

The main advantage of using Utf8 in symbol and object files is that it gives us a
compact encoding of the most important case, the ASCII characters.
They need only a single byte as they did in BB1.6. So as long as you use plain ASCII identifiers
the symbol and object files don't get larger. They are even unchanged.
When using 2-byte CHAR instead, all identifiers would need twice the space as before.

The compiler uses Utf8 also internally (in memory) because that avoids the conversion of all imported identifiers
when reading the symbol files. In addition it seemed simpler to migrate from BB1.6 that way.
Also the runtime system (kernel, loader, etc) uses Utf8 internally. As with the compiler this avoids some conversions
and seemed simpler to migrate from BB1.6.

- Josef

Ivan Denisov · Post by **Ivan Denisov** » Thu Nov 20, 2014 9:24 am

LuoWy sent me new updated version yesterday. As I understood he increased speed.
Now it is only 20% slower than Josef's version with difficult strings and 10% slower with ASCII strings.
However it makes all checks! Why we can not use it? And then simply export to Strings?

Code: Select all

Josef Templ version:
68.5 ms
12.3 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $FALSE
Truncated:  $TRUE

Alexandr Shiryaev version:
79.8 ms
15.7 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $TRUE
Truncated:  $TRUE

1st LuoWy version:
89.3 ms
15.2 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $TRUE
Truncated:  $TRUE

2nd LuoWy version:
81.5 ms
13.8 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $TRUE
Truncated:  $TRUE

The new version of LuoWy is attached here inside the test module Utf8ToStringLW2.

Josef Templ · Post by **Josef Templ** » Thu Nov 20, 2014 9:47 am

> However it makes all checks! Why we can not use it? And then simply export to Strings?

We cannot use it because its semantics does not fit with the definition of a string in Component Pascal.
Changing that definition would be a separate issue.

- Josef

Josef Templ · Post by **Josef Templ** » Thu Nov 20, 2014 9:50 am

The current version of StdLoader.ThisMod is questionable.
When looking into it in more detail I discovered that it actually should return an error
code instead of generating a TRAP when a Utf8 conversion error is detected.
The appropriate result code would be "syntaxError".

Any comments on that?

- Josef

DGDanforth · Post by **DGDanforth** » Thu Nov 20, 2014 10:48 am

Seems reasonable. Rather than

Code: Select all

Kernel.Utf8ToString(name, n, res); ASSERT(res = 0);

Zinn · Post by **Zinn** » Fri Nov 21, 2014 6:45 pm

Josef Templ wrote:The current version of StdLoader.ThisMod is questionable.
When looking into it in more detail I discovered that it actually should return an error
code instead of generating a TRAP when a Utf8 conversion error is detected.
The appropriate result code would be "syntaxError".

Yes, we should return an error code.
First I have to find out which error codes used already here.
Should the error code from Utf8 conversion distinct from the other error code?
Or should we insert IF res #0 THEN res := syntaxError END; ?

Ivan Denisov · Post by **Ivan Denisov** » Sun Nov 23, 2014 4:30 am

Josef Templ wrote:> However it makes all checks! Why we can not use it? And then simply export to Strings?

We cannot use it because its semantics does not fit with the definition of a string in Component Pascal.
Changing that definition would be a separate issue.

- Josef

No, Josef, I think you are making the mistake here. All Unicode is covered by this well-formed UTF8. So there are no violation of Component Pascal string definition. If forbidden sequences appears - this is hacker work or file damaged. I made Test5 procedure for demonstrate this.

Utf8Test.txt: (9.1 KiB) Downloaded 452 times

Sequences are forbidden if they should be coded earlier (shorter). For example, the part of this
0E0X 080X-09FH 080X-0BFX
forbidden region:

First column - forbidden encoding, second - the character, third - normal encoding.

Code: Select all

...
 224 129 161 = a =  97
 224 129 162 = b =  98
 224 129 163 = c =  99
 224 129 164 = d =  100
 224 129 165 = e =  101
 224 129 166 = f =  102
 224 129 167 = g =  103
 224 129 168 = h =  104
 224 129 169 = i =  105
 224 129 170 = j =  106
 224 129 171 = k =  107
 224 129 172 = l =  108
 224 129 173 = m =  109
 224 129 174 = n =  110
 224 129 175 = o =  111
 224 129 176 = p =  112
 224 129 177 = q =  113
 224 129 178 = r =  114
 224 129 179 = s =  115
 224 129 180 = t =  116
 224 129 181 = u =  117
 224 129 182 = v =  118
 224 129 183 = w =  119
 224 129 184 = x =  120
 224 129 185 = y =  121
 224 129 186 = z =  122
...

Ivan Denisov · Post by **Ivan Denisov** » Sun Nov 23, 2014 4:43 am

I think, now clear, that this is not "content check", but Unicode standard format check. And we should choose between:

- simple faster wiki format check (Josef Templ version),
- secure Unicode 7.0 format check (WenYing Luo version).

Please, lets vote for this.

Josef Templ · Post by **Josef Templ** » Sun Nov 23, 2014 7:26 am

Zinn wrote:
Josef Templ wrote:The current version of StdLoader.ThisMod is questionable.
When looking into it in more detail I discovered that it actually should return an error
code instead of generating a TRAP when a Utf8 conversion error is detected.
The appropriate result code would be "syntaxError".
Yes, we should return an error code.
First I have to find out which error codes used already here.
Should the error code from Utf8 conversion distinct from the other error code?
Or should we insert IF res #0 THEN res := syntaxError END; ?

Unfortunately, syntaxError is not the right error code.
It refers to the format of the code file itself, not to problems with converting the file name
Either we define a new one or we leave it with the ASSERT.
Note: the ASSERT will never be hit anyway because the real check appears at a different place
in Kernel.ThisMod where the StringToUtf8 conversion is done. And there is an ASSERT also there.

In general, returning an error code instead of a TRAP is much more complicated than I expected.
So I would refrain from doing it now.

- Josef

Josef Templ · Post by **Josef Templ** » Mon Nov 24, 2014 9:32 am

> No, Josef, I think you are making the mistake here. All Unicode is covered by this well-formed UTF8. So there are no violation of Component Pascal string definition. If forbidden sequences appears - this is hacker work or file damaged. I made Test5 procedure for demonstrate this.

Ivan, there is not only Utf8ToString but also StringToUtf8.
They should be symmetric in their behavior, i.e. a string encoded by
StringToUtf8 should be decoded by Utf8ToString.

- Josef

BlackBox Framework Center

issue-#19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers