Page 6 of 11

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 3:26 am
by Ivan Denisov
The solution from luowy.
luowy wrote:Hi Ivan,

I had written a procedure as your proposal(PEP 383), as I cant post reply on bbcenter's forum, I just send to you,maybe do some help for the issue#19.


Code: Select all

StdCoder.Decode ..,, ..fv....3QwdONl9RhOO9vRbf9b8R7fJHPNGomCrlAyIhgs,CbKBhZ
 --- end of encoding --- 

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 6:24 am
by DGDanforth
Ivan Denisov wrote:Zinn and Josef, please look at the table 3-7 here:
Notice that the exceptions to the rule only occur if there are 3 or 4 bytes.
If we restrict our support to two bytes (16 bits) then almost all of the world's languages
are supported and the conversions become trivial.

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 8:40 am
by Ivan Denisov
DGDanforth wrote:Notice that the exceptions to the rule only occur if there are 3 or 4 bytes.
If we restrict our support to two bytes (16 bits) then almost all of the world's languages
are supported and the conversions become trivial.
No, because number of bytes for utf-8 character does not match number of bytes in UCS-2 (2-byte Unicode).

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 8:50 am
by DGDanforth
Ivan Denisov wrote:
DGDanforth wrote:Notice that the exceptions to the rule only occur if there are 3 or 4 bytes.
If we restrict our support to two bytes (16 bits) then almost all of the world's languages
are supported and the conversions become trivial.
No, because number of bytes for utf-8 character does not match number of bytes in UCS-2 (2-byte Unicode).
So why are we using utf-8? Why aren't we using 2-byte Unicode?

I never did understand why Helmut did that.

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 10:24 am
by Josef Templ
Once again: a VALID String in Component Pascal is one that ends with 0X with any number and value of
16-bit Unicode characters preceding the 0X. There is NO NOTION of invalid Unicode characters in Component Pascal.
Why in the world do we need to introduce invalid Unicode characters now?
There was no problem in the past and there will be no problem in the future that is solved by
complicating the Utf8 conversion.

If it turns out to be a problem in the future we have to introduce the notion of invalid Unicode characters
e.g. in module Strings by introducing another character class. But this issue is completely
independent from our current issue of introducing 16-bit Unicode support for CP identifiers.

The Utf8-conversion in Strings is for users of ComponentPascal, not for all users of UTF-8 in the world.
Those other users may use C or C++, which suffers from undetected buffer overflows etc.
Component Pascal does not have this problem.

If we want to vote for it, I see three options:
1. no checks at all as proposed by Helmut
2. format checks according to the format as defined in Wikipedia, proposed by me
3. format checks plus content checks as proposed by Ivan

my comments on the choices:
(1) is too optimistic; there may be situations where due to an error or inconsistency
a program tries to decode a Utf8 string which has not been encoded before, for example because it has been
written to a file by BlackBox 1.6. This must be detected. Not checking the Utf-8 format is like not
checking the format for string-to-integer conversion or not checking the CP syntax in the compiler.

(2) is my choice. It is simple and almost as efficient as (1). For ASCII characters there is no difference at all.

(3) is way too complicated and deviates from the definition of a string in CP.
It is asymmetric in its behavior for encoding and decoding and it even slows down conversion
of ASCII characters, at least when the algorithm proposed by Ivan is used.
This is not the BlackBox style of doing it.

- Josef

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 11:48 am
by Ivan Denisov
The voting should be for simpler question. This procedure is "for export" or "for internal use"?

1. Kernel function Utf8ToString should be done "for export" expecting, that it will be used in any unexpected tasks for connecting BlackBox with UTF8 world, including library bindings and any unexpected input sequences. It should be correct according last Unicode standard. (Alexander's OR better LuoWy version)

2. Kernel function Utf8ToString should be done "for internal use" and should be renamed to AdoptStringFromSymbolFile or smth like this. It should be done to maximize efficiency and be simple. (Helmut's OR better Josef's version)

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Nov 17, 2014 10:53 pm
by DGDanforth
Ivan Denisov wrote: 2. Kernel function Utf8ToString should be done "for internal use" and should be renamed to AdoptStringFromSymbolFile or smth like this. It should be done to maximize efficiency and be simple. (Helmut's OR better Josef's version)

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Tue Nov 18, 2014 7:52 am
by Josef Templ
Ivan Denisov wrote:The voting should be for simpler question. This procedure is "for export" or "for internal use"?
This is a much more complicated question because it shifts the focus from the technical aspect (how is it done)
to the usage aspect (how it may be used). In addition the proposed kinds of usage don't give much sense.
A procedure exported (by means of an export marker *) is exported no matter what any vote decides.
I strongly propose to stay at the technical side.
The voting should be about which kinds of checks are performed.
no checks, format checks, content checks.

> UnicodeToString
> StringToUnicode

Doug, this is the wrong naming for sure because it hides the fact that it is about conversion from/to Utf-8 format.

A note to Ivan:

If you assign a character code to a CHAR variable in ComponentPascal (ch := 0yyyX;), there is no limitation regarding the
possible values of the assigned character code. As long as there is no such limitation, there is no value in
checking the contents of an Utf-8 string. You can always introduce illegal characters into a string
by means of an assignment of character codes, by means of reading in a two byte Unicode from a file, from the clipboard, etc.
Checking the contents of a CHAR or string is an independent issue that is much broader than
doing it only in the Utf8 conversion. If there is any need for doing such checks, it can be discussed in a separate issue.
Now we are blocking issue-#19 with mixing it up with a different issue.
Also there is a change in the README file committed by Ivan. This change is not related in any way with issue-#19.
Ivan, it seems that you have not understood the concept of a topic branch. The changes done for a topic branch should
all be related with that topic. That's why it is called a 'topic branch'. For somebody not experienced in
software engineering techniques this does not make a big difference, however, in the long term it
is a must in order not to get a complete mess in the repository and its history.

- Josef

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Tue Nov 18, 2014 8:21 pm
by Zinn
DGDanforth wrote:So why are we using utf-8? Why aren't we using 2-byte Unicode?
I never did understand why Helmut did that.
Please read again the complete blocks
- Feature #9: adding module Characters
- Issue #19: Unicode for Component Pascal identifiers
from the beginning to the end by obey the following rules
1. Skip your own comments
2. Read Josef’s explanations twice
3. Read all other entries once

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Tue Nov 18, 2014 8:32 pm
by Zinn
Josef Templ wrote: If we want to vote for it, I see three options:
1. no checks at all as proposed by Helmut
2. format checks according to the format as defined in Wikipedia, proposed by me
3. format checks plus content checks as proposed by Ivan

(2) is my choice. It is simple and almost as efficient as (1). For ASCII characters there is no difference at all.
I see only point (2) as the right solution. I have the same opinion as Josef.
The last published version of CPC Edition 1.7-RC4 Built 15 from 11.11.2014
uses Josef’s solution for Utf8ToString conversion.
It is the best solution.