issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
Post Reply
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

The solution from luowy.
luowy wrote:Hi Ivan,

I had written a procedure as your proposal(PEP 383), as I cant post reply on bbcenter's forum, I just send to you,maybe do some help for the issue#19.

Regards,
luowy

Code: Select all

StdCoder.Decode ..,, ..fv....3QwdONl9RhOO9vRbf9b8R7fJHPNGomCrlAyIhgs,CbKBhZ
 xi2,CoruKu4qouqm8rtuGfa4.hOO9vRb1Y66wb8RTfQ9vQRtIdvPZHWKqtCa.E.U5Usp,6.5Qw
 dONlnayKmKKqCLLCJuGqayKm6F9vQ5nsH3.bnayKmKa2,Cor.kay4.qorGqmQCU2,CJuyKtQC9
 8P9PP7ONbXmb.2.AdAk5kUm.,6.k39.86.QC18RdfQHfMf9R9vQ7ONb1E.kHE.0.p.,6.jdLL3
 0EJYjyC.6.VQ.E4k.8Mtf.2.S02.e,2UgW.Ue.E.mP,UAU0IkmL,6.Y32.I16.j,6.J,U.YLk.
 0.85CE,9T3E.0.n00.p.0U.460.J,U.2GE4E.q,CE3U2V1w,61s.VU.64s.T.S.8E0E08Mtf.2
 .y20E.c4E.2E2.e0U.2Uw0e.8EOE.a78k8E.a,8k.E.U1o.2U5U3IkmL,6..EBU.YJ2.I3.,6.
 V2g0MR1U1A20k0u0I,QU,U.A2I.6.FR.QUDU.21gUdU1Y,MD6.1U.QUF2.0k7k0e,0kIE,O,2,
 ,E,4.0k7k,C,0E1k044Ck0E0G.CE,U8U.Y2QUZU.g2gUPU1Y0gUX,9.7.CE,k0a,0EJ2.5s16.
 d0zT1H6IZuH5OF7OJZOF,NJdfNl7JTvIdfQHfPDf8,78HeH,NRdfNldC,NEZeI1OK,tHB86b8G
 TeIduEFOEZuC,tHf8J,tQdfQp761eI.CIY42UmhgnJbUAdCZe3xc3JedQbBAV7QcDpdHZeUAhg
 ZhZxgVZh0BjohgUgbUAav2YoJipphXBgohgY3Yx2Yl2av2Ze2YmhgnhigZiUIZdgV7AV1,Oqo8
 rtGLEqHE0nR0Gu4qomKEqHE4nRWGJ0mtGrkGrmemIqk4ak2OpU8JEWLK0momGEeKK0mq4KweHE
 aIb.rN1HM0HsMFfC,tIF0UBUnZZUQimIbUAdC,2YcIZUQC66JN8PU7Yiu2Y7,.Grka43PSdPNb
 96JN8P.TvON76bPRZfQp763uHT8H9OERuCH66Fd8,tQffQyqn4KuKKE.qk2aEfEIeGcKIcCHQC
 HJam4aU7Igppgu2Y,,CHEyIX0md..ohg2YhJbUAdC,g,g,3OFDOGR86ZPN0GRqHE0nRqk2ako0
 GRqHE66J96pND,,PPMl96pND,7H9eHFtQdfQH76P76XtC,tQ,dCvFnaKtsC,N1HM0j8GH8H986
 FNRdfNipoqJECGE0HgaGEOGEWGp0GS0mq4ad2Y2xdUgV7M05HEenSgiopAsCPM0akYOIECLEqH
 EO42YI3d3pdUgV7A,HcMQfkgfUIbxsMFvC,dP,dCvlMi1Z76pND01bPRZHEenSoc,ZdHhcv2YU
 gV7A,HsE1uI98659O,tHB86PM0AVw3Yl2fcIZk2feAZioZrocMJbUQioJiPJhR3Yug55nRAdCR
 ccIhdQbBU7YDVtEZ7KRd9V7FB8Kp76l96pNDyIdGIICKoaGEqGEAbmQbUQiUUoBgdtC,7R,dCv
 lMgV7k2m598Ale9R7A9eFleC,N1HU76S,dCw7.IamYav2YBU7MGQgc3Yx2Ykgck66d8GsQZ76M
 AU7MFNuIHeF,tM.78K,7J.ENin4a.HkWuIWin4a.Hkt0GR6R.EN.H6Tock2fio3B8BleC,N1M0
 W5w7.6BVtC,N1M0a2.B8A0Ge.UnQbUgV7k2gcA,.GXU..Gn4ak2A,9eH.HktUo,.bVnhCUIJeJ
 hcvgV7k2KIagcU2ZesMTfPbPRPPN,dNHHuCLu0mom46631,,M0CLu8rh.CIY8JI0HWCIM0HY0m
 J0mb8JWUdQbUAdC,0GtKqtUl.akWu2UAVBAV7M0THEenSY866PM0AV1,bfA,tHBO1HM0HM0tXu
 2Y7pcU2ZX3hUYbU2as2aMBZDJecQgc3Yy2YkIc43fd2YI3d3lriKEe1B0iX3pd2R5M0tnMeHEa
 IX.tV,3aMBZD,u1.....aEyIau2Y7p6ES6C.cDAb43fd22....aEyQau2Y7p6ESMCV7KH,cDI6
 ....U7YDddC,NG.tVs.UyEQOIgaGE....M09eH22M0HWjRBd8G9WBU76F9uEF7RHtC,7S,dC2j
 UIZUoao2Yf22Ud224HNWnR0m4k2A7GLEq1,7JF0k2MGQipVI3d3FIeGEC5y4.M06F6SN76X7AV
 7AV7GHtCPM0A,QC.Q6EQ0HMWIE6S,7FHeJ,7BV7AFGIemayIW0GO0HMIZdAZv22.P.H.b1.I8Q
 6s8MHT8F,,aWUA7UU2ZeghVBjWhgUIhyghV3jU2YeA3M0gcAl4ak2A,QC..lP8r76.g,A,KIbM
 1M0QiUUaltQ5k2gcAFE8quOqhuqi0GRsMMGcPHtC,tQI501k2gcC,AV3Z7WGJ0momKq.kt.A,B
 uHZ86P96pND22T86R96P76X767uHU7kYcO,7Dv76PPMl96d8GsQd1U1Vk.kbElKLnghRBZdQbU
 .J190,78J767uEl8K,76J,A,9eHg,A,ZPNb96MA0GWKoVWmoamRQiUUa,UUohjZiUQgjpBkomK
 q.dPMHHEUU.AV3,.U7pd13ZdB3PM0K2kYOYcQiUg5dPMAZa2Zi3Yy2YkAZUY868KLr8rmCrrmK
 vKKm0Gla5bf8HN1cF.24..k2a2J1.kt.kV....MGcO.00G2.MFEtK4MAq.90PU7luG566EEG3A
 dCR6ZPNb99,NAVN8,NFR8F0GI6RZPRUYd8kYcOuHEqqkW5..kR0Gp00qqkQbUgcCl4sQd1Uk2f
 vgV7gcC76f8RBHeQ8a4rN1HcUXDJ9X1xhiZimxhgZhZJinpZHZC58RZ9P7ONbvM,Mwd0.UiQcj
 pho,YcZRiX3.5011.85...CLL.U2V.IS2U.UIU.U76.2..AU0CyIVGhighgmRiiQ88pum470,M
 wd0UnpZGhighA70,cw5.0.LJ.w.QI2U.sU.ktumdsIdPSNPN7ONbH.4D.o3aLq.,cwFE.2..F.
 pG.2U.E,,.RNEd1K5GomCb.6,6..UYU.AU.U.UUQoOF.2Uwpr,6C5H.WnlM.E.cUZj0E..UO.,
 .1.eWwV.E.0t.U...Xi0...
 --- end of encoding --- 
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Ivan Denisov wrote:Zinn and Josef, please look at the table 3-7 here:
http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf
wellformedutf8.png
Notice that the exceptions to the rule only occur if there are 3 or 4 bytes.
If we restrict our support to two bytes (16 bits) then almost all of the world's languages
are supported and the conversions become trivial.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

DGDanforth wrote:Notice that the exceptions to the rule only occur if there are 3 or 4 bytes.
If we restrict our support to two bytes (16 bits) then almost all of the world's languages
are supported and the conversions become trivial.
No, because number of bytes for utf-8 character does not match number of bytes in UCS-2 (2-byte Unicode).
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Ivan Denisov wrote:
DGDanforth wrote:Notice that the exceptions to the rule only occur if there are 3 or 4 bytes.
If we restrict our support to two bytes (16 bits) then almost all of the world's languages
are supported and the conversions become trivial.
No, because number of bytes for utf-8 character does not match number of bytes in UCS-2 (2-byte Unicode).
So why are we using utf-8? Why aren't we using 2-byte Unicode?

I never did understand why Helmut did that.
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

Once again: a VALID String in Component Pascal is one that ends with 0X with any number and value of
16-bit Unicode characters preceding the 0X. There is NO NOTION of invalid Unicode characters in Component Pascal.
Why in the world do we need to introduce invalid Unicode characters now?
There was no problem in the past and there will be no problem in the future that is solved by
complicating the Utf8 conversion.

If it turns out to be a problem in the future we have to introduce the notion of invalid Unicode characters
e.g. in module Strings by introducing another character class. But this issue is completely
independent from our current issue of introducing 16-bit Unicode support for CP identifiers.

The Utf8-conversion in Strings is for users of ComponentPascal, not for all users of UTF-8 in the world.
Those other users may use C or C++, which suffers from undetected buffer overflows etc.
Component Pascal does not have this problem.

If we want to vote for it, I see three options:
1. no checks at all as proposed by Helmut
2. format checks according to the format as defined in Wikipedia, proposed by me
3. format checks plus content checks as proposed by Ivan

my comments on the choices:
(1) is too optimistic; there may be situations where due to an error or inconsistency
a program tries to decode a Utf8 string which has not been encoded before, for example because it has been
written to a file by BlackBox 1.6. This must be detected. Not checking the Utf-8 format is like not
checking the format for string-to-integer conversion or not checking the CP syntax in the compiler.

(2) is my choice. It is simple and almost as efficient as (1). For ASCII characters there is no difference at all.

(3) is way too complicated and deviates from the definition of a string in CP.
It is asymmetric in its behavior for encoding and decoding and it even slows down conversion
of ASCII characters, at least when the algorithm proposed by Ivan is used.
This is not the BlackBox style of doing it.

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

The voting should be for simpler question. This procedure is "for export" or "for internal use"?

1. Kernel function Utf8ToString should be done "for export" expecting, that it will be used in any unexpected tasks for connecting BlackBox with UTF8 world, including library bindings and any unexpected input sequences. It should be correct according last Unicode standard. (Alexander's OR better LuoWy version)

2. Kernel function Utf8ToString should be done "for internal use" and should be renamed to AdoptStringFromSymbolFile or smth like this. It should be done to maximize efficiency and be simple. (Helmut's OR better Josef's version)
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Ivan Denisov wrote: 2. Kernel function Utf8ToString should be done "for internal use" and should be renamed to AdoptStringFromSymbolFile or smth like this. It should be done to maximize efficiency and be simple. (Helmut's OR better Josef's version)
UnicodeToString
StringToUnicode
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

Ivan Denisov wrote:The voting should be for simpler question. This procedure is "for export" or "for internal use"?
This is a much more complicated question because it shifts the focus from the technical aspect (how is it done)
to the usage aspect (how it may be used). In addition the proposed kinds of usage don't give much sense.
A procedure exported (by means of an export marker *) is exported no matter what any vote decides.
I strongly propose to stay at the technical side.
The voting should be about which kinds of checks are performed.
no checks, format checks, content checks.

> UnicodeToString
> StringToUnicode

Doug, this is the wrong naming for sure because it hides the fact that it is about conversion from/to Utf-8 format.


A note to Ivan:

If you assign a character code to a CHAR variable in ComponentPascal (ch := 0yyyX;), there is no limitation regarding the
possible values of the assigned character code. As long as there is no such limitation, there is no value in
checking the contents of an Utf-8 string. You can always introduce illegal characters into a string
by means of an assignment of character codes, by means of reading in a two byte Unicode from a file, from the clipboard, etc.
Checking the contents of a CHAR or string is an independent issue that is much broader than
doing it only in the Utf8 conversion. If there is any need for doing such checks, it can be discussed in a separate issue.
Now we are blocking issue-#19 with mixing it up with a different issue.
Also there is a change in the README file committed by Ivan. This change is not related in any way with issue-#19.
Ivan, it seems that you have not understood the concept of a topic branch. The changes done for a topic branch should
all be related with that topic. That's why it is called a 'topic branch'. For somebody not experienced in
software engineering techniques this does not make a big difference, however, in the long term it
is a must in order not to get a complete mess in the repository and its history.

- Josef
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Zinn »

DGDanforth wrote:So why are we using utf-8? Why aren't we using 2-byte Unicode?
I never did understand why Helmut did that.
Please read again the complete blocks
- Feature #9: adding module Characters
and
- Issue #19: Unicode for Component Pascal identifiers
from the beginning to the end by obey the following rules
1. Skip your own comments
2. Read Josef’s explanations twice
3. Read all other entries once
Last edited by Zinn on Tue Nov 18, 2014 8:48 pm, edited 1 time in total.
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Zinn »

Josef Templ wrote: If we want to vote for it, I see three options:
1. no checks at all as proposed by Helmut
2. format checks according to the format as defined in Wikipedia, proposed by me
3. format checks plus content checks as proposed by Ivan

(2) is my choice. It is simple and almost as efficient as (1). For ASCII characters there is no difference at all.
I see only point (2) as the right solution. I have the same opinion as Josef.
The last published version of CPC Edition 1.7-RC4 Built 15 from 11.11.2014
uses Josef’s solution for Utf8ToString conversion.
It is the best solution.
Post Reply