issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
Post Reply
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Naming
Strings.CharsToSChars(IN x: ARRAY OF CHAR; OUT y: ARRAY OF SHORTCHAR);
Strings.SCharsToChars(IN x: ARRAY OF SHORTCHAR; OUT y: ARRAY OF CHAR; OUT res: INTEGER);

Only in the documentation of these functions is there reference to Utf8.
CharsToSChars always succeeds.
SCharsToChars needs a validity checker to notify the user via res that the short character sequence is valid.
It may not be if generated by some process other than Strings.

By using "Chars" one knows that the standard 16-bit CHAR is being used.

-Doug
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

I make some measures for demonstration.

First. During the starting BlackBox uses Kernel.Utf8ToString 7046 times.

Second I made test to evaluate speed of difficult string and simple ASCII strings conversion. The number shows how much time to take to convert two model strings 100000 times (1 string: "test string ïöíúüáä это строка для проверки هیروگلیفsd", 2nd string: "simple ASCII string").
You can see that the difference between Josef Templ and LuoWy version for simple ASCII is 30% for difficult is 50%.

Third as you can see that Josef method fail to detect bad-formatted sequences according Unicode 7 standard.
Josef Templ version:
68.7 ms
12.3 ms
Incorrect input 1: $TRUE
Incorrect input 2: $FALSE
Truncated: $TRUE

Alexander Shiryaev version:
80.1 ms
15.9 ms
Incorrect input 1: $TRUE
Incorrect input 2: $TRUE
Truncated: $TRUE

LuoWy version:
89.2 ms
15.4 ms
Incorrect input 1: $TRUE
Incorrect input 2: $TRUE
Truncated: $TRUE
This is the test I was using.

Code: Select all

StdCoder.Decode ..,, ..k40...3Qw7uP5PRPPNR9Rbf9b8R79FTvMf1GomCrlAy2xhX,Cb2x
 hXhC6FU1xhiZiVBhihgmRiioedhgrZcZRiXFfaqmSrtuGfa4700zdGrr8rmCLLCJuyKtYcZRiX
 7.2.s,MB1,0k,5TWyql.bnayKmKKqGomC5XzET1.PuP.MHT9N9ntumaU2,CJuyKtQC98P9PP7O
 NbXmb.2.QkUk2kx00,6.cUGpmWLuOpoKqvCbHZiYpedhA704TeKKw.bHfEWUmL.6..D.9762U.
 sUXDJ99SqorGqmQCbWBxhYFWUl1UnNHEWUmr.6.wZSk5kHu,,E.M.Vd.cU.ktAcoZimBhWhioh
 gnZcZRCY.2.wY.E.0.p.,6.M.,.JFyuv.U.2m,.7z.AU0KyB.,U2X.UO.,.1.e0.,6j3.y.wU7
 k.y.Ac,k.82QHAUJ2.A1AU,E.0kO8.1M.kI9.O,2,A.E71U.AU2U.QX,k.G.0U26.ezzzzzB.6
 xzzzz5cy.6wzzzz5.we,E.2.uFq8Ua5V0cUXDF9fR5uPPPP1fP7PNZvQRtIdHf.2UlbcZpC.c9
 h0E.8z,E.0.,V.2.o.6.K,yU.E.ED.cwR.0.,,,.B.0UJ.E.ED..6.222.o.6.K,0,,6.ED..6
 .222.o.6.K,0,,6.ED..6..E1U.M3c1szPuH7OJNOF,7J9vQdPJdfNltCPM1H68J76Je9,7J9P
 PV9PN761e9,tIFPOZPS1PNh99,NGR767ONRPObvPh99,tJR76NORT96BvPZ96HuQbPR9965NAn
 76JN8PM1HMGP8ITeId86b8RZPORvNb99,7HTvNN76LONZfP99PrN1HM1HcJ1eIPM0H6Qp76VeI
 TuE98FfeI988HeH,NORfC,NEZeI1OK,tHB86b8GTeIduEFOEZuC,tHf8J,tPf9Rp761eI.CIY4
 2UmhgnJbUAdCZe3xc3JedQbBgV72eGxd1,0GeK4Xd8rN1HcJ1GE8rmC5.in4ak2aKreHE.8HM0
 mbOIEC3.q.TPRdfC..C2EVKoXaIbqk2ako2YugbUIYW2Yf2Ykgc23fUQZU2a,3aM3YfEgin4ak
 20LIaKrmGEyquYZUIiZN8rN1HM0NuPDf9b8RZPO2ZWAdiRgjJimhgXZiUAhi3ipZiUAau2YWAZ
 v2YAxhbpZ0xhjZhcIiZRiUgbUIadQbU.NePrN1HMFR8F,7J9vQQbBg,..IaeQbBAVK,I5.k2aa
 u..sI..T1..C2EV.kh0ni0GRqHE0nWGYvg,a4XNL,dCvV,BaMRbBA,cAv86pNDWnVWpRqk2Ung
 fUIbx6A90.dNL,dCcE98KrN1UpgfUI56BluC.hNL,dC6Kr,V1.kIUAVH,UX,UkVmIbUIYd.I6.
 ,,.m2g6.in4qk2..sAJtCPM0h0cC.U7ABp,.kd..y4.66TeF,tE.30kdGLta4bf9b0Yejheopg
 s2ZWYhjphb3YnZimVUogjFuKaoJYg2YdphgcQ9nIUk,y4.Eakd.FFe8ruuql4KuKKmeHE8mI.8
 2UUkMamR0GaEaU3,sArN1PM0.keGLnYejReoJidlUCJJ0GIaIb00.w7C3.00wBe1.Q6.ICe1.H
 l4aEfEmmGE8KK2jg2YdZZUIhg2YhBgsJbUAdC,i130Em0GRqHEQbUAhUIbx.J96pNDkq4Kw0GR
 qXAhcC3ZjhioBZUgZUAavgV7AVL3d7Zd33YcAhUYbUYd3p7HfPHN8,d7,78Hnhaqi0mF0GMWpI
 0GH0WY3YygbU2ad2Y2xdU2jUIbxsHZ8FFNOR1HtCPM0akV4odKIEGKEyIX2augV7AV7AV1,l96
 TeFoZiwa43dugV7k2Ad43Ye3Yw2YhBAGJYKIb0mrKLuiJpqJEe158GZ88lP8r76HeH588JP8PM
 0HM0HkWmodKIEGaug5PdAPM0Hk2gcCN1HM0HM0t96VtEZ7GRd9V7FB8Gp76396pNDWLEqobGIE
 8HMWoR0Gm0GRMAPM0Hk2m598AFe9R7A9eFFeC,,2DkM0HYiHE.ZN1HM0akWm2.PNAPM0Hk2KIb
 .tnMen4ak2AV1,w7WHMWILuW0pc6JbBU7MG....C28KEeGEGHMWIEiGEWLEq2GHMWoIiHE.iHE
 GKEEMqk2aU7MFN0UhI4M0KIb.HMFN0.X,akWu26T8HRqk2aU7Q6..aHXWIRq.aU7Fl0mS0GM0G
 eW2GKEe1.HkWm28KEe1396J,00.sC,,A4M0KIb.H6T0nU0HYuWk.k2A7.78G,7J.M9U7MFN0I5
 8ae.,,.Q5Ul.HkWu2M098H6NkK.HMFR0UvgV7kYuoVWmoam4ak2K2krsK.V7KkYOYY3Yx2Yk22
 EtKaUIbx22,78J76TvO,d8HN16HaIX0GmkK8HEGJYK2.4HEWGJ0Gu8ru.0GJam4.9GtKqtUmQb
 U2Ze2YYhgXxhYhgUAhiRAP9QNPNdPN,tPZnm8LtyKt0GJam4Ebg,9eH0meGLn..rN1HM103...
 eIeeGEWmYu20GR0mU.sI.Q5kr66p7610U1,qEE0GE0GE..kbKJeICeX7,A30GEOpU8JEaKK00O
 rkmKK0mq44p76H0sC,tMF96p76b8G.gVU2YB2YU2eGx7.BuPZPP19R9eQZvPWmIin40W0hc5B7
 ,tPkSAhiZYvcQ9vQ,,Z76Fd8BvPZ1,NNZfQIZdgVU2Y3pd23Y4xhmV3,rN1gV7Ic3VB2YU2YX3
 hUMDaaP3aRRbUAhUUlQbUIhUUkQ5Ux6HsPA3k40WUwe6BdAVX3hUQYU24Ue3YwMPAZUY6P66,F
 E0mYOIECKo0GS0GQ0Hg0GeW2P6666whptK.59Or76HeHIBP660WUgcARe7llUkg6l86d0UUEv4
 aUIbxsHsMFP8,N9AbmQbB2222a2hPMNHS0WUY82Y4xhm,UUIe3ZeJJeC3Y3pd2tCP66,VUklcC
 ABaav2Y7VdtC,dR19Pe1hPMNHJ0GPGHEiGEyIdG2.Z7CrN1,,,V7FICKo0GS25aWDJeU2ZXFTq
 198AaGEGJYK2.....,Vj,U1VqBggBZv22cO.,NF.Y5VdFV7K,7Jk40GE0WUEv66pVD7FM8qWmI
 aoQbB2222a2h1tVU.O2....66QAe1H1.a4UuEvEJ..Ul.EEMG230GS25sH00o5.G3.....22C4
 pVd,..I5O5e0.kIg3.66A7EEm1kb.u1.d0.....00T1.50M8.J1669W3hVU2200BuP..UB2YUE
 EKIb.665Xuko.UdN1,76K2rFEsP.6Al0A7C4v76V7K,,IC.,VjRheAZUgcAR8,,kM00dfQffPU
 eAZUgcCZcB.K3.UvgV7g,.ke..NuJJ76F,f9RB9Cp7610cFC3UG76T0b9RZfC,NE.Q6.ICe1.H
 V7sETeHb8J,NON9P9vN19P3OSdPN,ND,dArFu8ru.0mS0mMiHIeGE8rmgigZiUI3aU1,Oqo8rt
 GLEqXkQ5dPMH9P,ND,NAktGrkGrmemIqk4AVKB6WLK00kqcC.QbBAV7AVX3hu2YH,UBUnZZUQi
 mI5HeH.,78QC66JN8M0dfC,NG.UoBgdFlaLuKqt0GJa0whfZZUQipJimJbUIcDxdAhc,pdvAVU
 2Ze2YnhimtPDPMdPN,,M1HM0h0Fd8VOFVuAltAJN8PM0aElKLneHE42sA,tHB86b0.ErmGEqKR
 0mYu2k4k4Ic3,HEtKqt000nRqk2ako0GRq10Gp.gB00m2CLu8rI0mK0Wv2Yn3Yug5BPOZXv2YB
 AV7w8UpZiatKHPL,t601O0.a00m4aU7QA,dCgiopAsCPM0akYOYn3Yx2Ya,,7JF0PM0Hk284r8
 Av86sMcP,dCvlMi1Z76pVkktKLt2Yug5BOENuI9uC,76PM0Hk2CoUklWKEyIXgV7k2mbl2fcIZ
 k2feAZioZrocMJbUQioJiPJhR32QArlYEpQbBU7YDVtEWJLuGMGYMJbU2jUI5yY2VdM9A50mt0
 0Grkaav2Yo3YukMgV7k2m59WioZkg6leC,N1HU76S,dCw7.I42YBU7MGQAqXkg600sQZ76MAU7
 MFNmY.EW2YI,.ZtCPM0A,9eHiX7k2QiUI5G5.I4k2m5BWio3B8BleC,N1M0W5w7.6BVtC,N1M0
 a2.B8A00.sAr76PM0A,98H.Uo66.UogV7A,HkWu2k2QC6R.kNsQf16JZOJ9uCPM0AV3Z7986Fd
 8CqruqtKrqKKEOqoYinZiU6PUUIA22U7sQdfQr0sE6A7uEV7AF86L,UdQ5H066ZPN00aKqm4I6
 QbBA,akWu2UAVBA,akr2Yug5d00m4a.HsEkt8XDpcBAV7AV7YDeHEaIX0GI66tFMWHMWZDJecQ
 gc3Yy2YkIc43fd2YI,TvOe1B0iX3pd2R5M0tnMeHEaIX.tV,7KHtHUy.....U7YDZFEaIX.tVs
 .UykQOIgaGE....M0tnNeHEaIX.tVt2aMB3UyEV.....aEyYau2Y7,Y5W1.u1ldFlO8,,...U7
 gcCFEU7A7yqpYe6h6q.aU2hc13ZoBZv2Ys3YuEwIZU..aGEMAZVUg,A,HWo3YxEEG3.HU7ltK5
 GJYEIeGEC5y4.M06F6SN76X7AV7AV7GHV7k2kt.kV.l7AV7G,Vs7FHeJ,7BV7AFGIemayIW00V
 FJamIiHE.g,A,QC.EdkVUfUB,M80mY.0GIemq4qw8qm0Gpunq4Kw0GEemIU7kWm2PM0Hk2kt..
 2D.k4k2MFRWBU7kt00O4bnR.HkWm2,dMffNrePv86pVXV7ViBZv2YnFR6A.HkWu2k2KIa68J76
 H9PN100QC.HEXyId0mq0GREEwdUohUgZUAaUYcD,a.HWe7Dv76PPMl96d0CLu.sE6A.w7IgppA
 PPLHN8r76EpkWEEWGJ0GWCIgWJE0GJ.HkWu2P.HEt.aKq.30r767OF588HP8rlt00O4.0GryKu
 0mlyKr.H9PUUYiVBB,,00k2K2..aIbMOg,K2kYOYcQiUg5dPMAZa2Zi3Yy2YkAZUY868KLr8rm
 CrrmKvKKm0Gla5bf8HN1cF.24..k2a2J1.kt.kV....MGcO.00G2.MFEt.a4EVkRq.90PU7luG
 566EE.Z16RZPRUYNFRWU23GLt.c8kYcOuHEqqkW5...QbUIBMP1nR0mWu2PUnZCe46AluCakW6
 6EeQ8a4QbBA,0Jd.sEFPN5vOb8Q9PN796F7R9vQCJu85eHE4IdUDlVkIi1h03PM5vOp7610Aam
 66T0sC,NRdfNeX,,8nOOHEyIX0md.k4cQYZUAhgcPp76HeHK2UolU.X7A,tHBGayIbSoYuIein
 4A,VdC,7HT0U0,sG9fQRXiQeoJidFe6R2ZohA.mGEKLuOag2YmlIk2qU7o6ZGr0GREEEaKIbYi
 d2Yh.qk2ak2GbUIbxsG9fQEeaqqKKIamRq.O2aKRqHE01Aak240HEGobq.aEsgio,3PM5HK0Gt
 K4PM0akWsCA,dvKRPL,,b8GTWc2Z9hgm,.,N9GLMaGEeGE4HM0XUMGQdZJC6RHPP9fI9vQTnuG
 royKram4aU3,Yik.VtCPU4Vi,0Ge6H.A470a.Eu0HEiGEGrhuqiqk2a.UAVGhgVZh4xhm78d9A
 T7H9eHFVg2Yloag2YkYZUgZlYZU2b439r76N0b023,NPbf6HtC,7H6Hqk2KIb2Y13hZ,sCPM1P
 M0VeI.Y8YaeQbBAVKVnZCX79,tQdHNeHE42EN.50M12eG,EWyKeKqtGLIamRq.3OF.HsEFPN51
 68b9RZPAHtCPU7..ENamRq.GpmCLu41.81a.sAAV7g6YcjZ8QbBk4I6.N0HePd98LONUpZCGrr
 CJuUChihBZv2YAVAtCU7QC4HEenSIYohgn76bXdFEySvrSxnzkHSEqI09I0vH01mUGMVGMUGMT
 GMRGMMGMEGH0jH0zI01mTGMUGMTIa2ia24c2Kb24b24Y5pkArklok6pkjqk2pkArk,pknZgaJY
 vgV7sQIaUI5QidhhkZhZ3Y,Re1Bd73YnZimVWQbBA,PUA,2YAxBC368eorCrmOKEGpm2CoiZJi
 nBhjphuIYdQbU.NWBEs0GRqHEKJu6JC3HX8V76FT9J9XvUB.UU.b02318P99S1fP7PNZ96b8OH
 1OLEOrm85.UAl46Q.kd.10Y6d0k4.EEUH,WWAhijxeZ3YqhA..m2PUk,.b0Y7A,7Ge.g,9eHYe
 ZRCdtCPM1KIbG2MJc9PM136J9vQ.dONbnMqE,G3K3.ZN136J9XJ,kNqk48Eeke.6BPM136JMJ.
 ga0CyIhACoruKu8rrmKqKKtCLLCZYRcoJigZcZRiX3Ulb8..umVyKrG5EWQiX3.501POELUm,.
 .Unp3.6F6.ZD,2U.UIU.U76.0E..k.8ssH38pumqm8rtumdcIf9PY62Ulb8.CLL8pumqmY62Um
 T.2U.kJ3.D6.0kFF.0U10.bf9bWHZitZhZZcZtM,Mw.ELMSN12Umz.,6.0.E2Eh2U.2U.E,,.R
 NEd1K5GomCrl0U2U...G00k.0.0.0mFf3,E.mLT5UTyB,M.,U,U.2.8Mtr.2..c4,.,.1.e06.
 2UEC.6..mEw3UAUgQnPt0lLU8ssHorMP9fPsET1.UG2U.E..U6U..HE.6aLuQ0mHCe.az86Utj
 0WlbWaUKZM4.Co0...
 --- end of encoding ---
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

DGDanforth wrote:Naming
Strings.CharsToSChars(IN x: ARRAY OF CHAR; OUT y: ARRAY OF SHORTCHAR);
Strings.SCharsToChars(IN x: ARRAY OF SHORTCHAR; OUT y: ARRAY OF CHAR; OUT res: INTEGER);

-Doug
Sorry, Doug, but this is also wrong.
A short string and a Utf8 encoded string are NOT the same.
A short string is a string that consists of SHORTCHARs only. No encoding, plain ISO-Latin1, 1 byte per character.
A Utf8 string is a sequence of bytes that encodes a (long) string and in order not to use SYSTEM.BYTE
as the element type it uses SHORTCHAR as an alternative that avoids importing SYSTEM.
The true nature of a Utf-8 string is actually ARRAY OF SYSTEM.BYTE.
For strings consisting of plain ASCII characters only, short strings and Utf8
strings have the same representation.

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef Templ wrote:The voting should be about which kinds of checks are performed.
no checks, format checks, content checks.
Josef, but you version does not make format check as you saying. Because format check means to detect if the sequence is well-formed according Unicode 7 standard. You did not make this check. So you talking about "wiki format check".

- no checks
- wiki format checks
- format checks
Josef Templ wrote:If you assign a character code to a CHAR variable in ComponentPascal (ch := 0yyyX;), there is no limitation regarding the
possible values of the assigned character code. As long as there is no such limitation, there is no value in
checking the contents of an Utf-8 string. You can always introduce illegal characters into a string
by means of an assignment of character codes, by means of reading in a two byte Unicode from a file, from the clipboard, etc.
Checking the contents of a CHAR or string is an independent issue that is much broader than
doing it only in the Utf8 conversion. If there is any need for doing such checks, it can be discussed in a separate issue.
Now we are blocking issue-#19 with mixing it up with a different issue.
That is OK, if we will rename function and will not use it in Strings module.
Josef Templ wrote:Also there is a change in the README file committed by Ivan. This change is not related in any way with issue-#19.
Ivan, it seems that you have not understood the concept of a topic branch. The changes done for a topic branch should
all be related with that topic. That's why it is called a 'topic branch'. For somebody not experienced in
software engineering techniques this does not make a big difference, however, in the long term it
is a must in order not to get a complete mess in the repository and its history.
I do not think, that we should make votes for changes which aimed to support repository managements. Now we are two with Josef who responsible for this and we can make any changes without voting which do not touch blackbox. This is done for people can easily work with our repository. So I fixed building pipeline (you liked this, but this is also have no common with #19) and fixed README.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

I want to make everybody attention for the fact that two people not from the Center have sent their solutions to us. So this is make sense for people how we will solve this issue. This is important for them, that we should use correct code.
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Zinn »

Josef's solution is also correct code. It does its task very well. There exists more than one solution.
The solution from Alexander and the solution from Louwy can be put into a library
and used with other projects where you need this kind of error behavior.
Feel free to use it in your own programs.

There are other places inside BlackBox
where uses “inline procedure” for translating Utf8 to String without any error checking.
(e.g. string constants)
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Helmut and Josef, can we put LuoWy solution to Strings but in Kernel keep Josef's faster version?

I also like how LuoWy return error. It allows to detect both the truncation and wrong characters errors simultaneously.

1 in case of error, 10 in case of truncation and 11 in case of error and truncation. So:
(res MOD 10 = 1) gives wrong character error
(res DIV 10 = 1) gives truncation error
You can see this in original LuoWy version.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Josef Templ wrote: For strings consisting of plain ASCII characters only, short strings and Utf8
strings have the same representation.

- Josef
That is all that I was claiming. Any other input by CHAR would not necessarily generate ISO bytes. But that is fine. The SHORTCHARs are just an encoding of the CHAR array. I simply wanted to avoid the use of Utf8 in the name because the 16 bit Unicode does not include all of the possible Utf8 encodings. Hence to call it Utf8 is misleading. It is a subset of Utf8. Call it partial Utf8 or something like that. This is a different issue from that of valid Utf8 sequences.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Is the use of Utf8 simply to avoid changing SHORTCHAR to CHAR within all of the compiler modules?
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

I just searched my version of BB1.6 and was shocked to find that the compiler modules of that version do use Utf8!

I was under the impression that Helmut was the first to use Utf8 with BlackBox.

Has my version of BB1.6 been corrupted or does it really use Utf8?
Post Reply