issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef Templ wrote:Ivan, please test more carefully.
May be you forgot to link the modified Kernel to the .exe file.
My tests show that your example correctly reports an illegal Utf8-format.
Josef, can you please share your testing tool? Now I am the mediator between Alexander and Center. I trust him, but you right, we need to test it more carefully.
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

here we go...

StdCoder.Decode ..,, ..NF....3QwdONl9RhOO9vRbf9b8R7fJHPNGomCrlAyIhgs,CbKBhZ
xi2,CoruKu4qouqm8rtuGfa4.hOO9vRb1Y66wb8RTfQ9vQRtIdvPZHWKqtCa.E.U5UaT.2U.Qk
lbeZ3DPuP7PNNvQRtId9NPuP7X2hgnRAXDJ.QCPuP7PNG2sET1.PuP.MHT9N9nt.G2sIdvPZnt
gcghghZcZRC8T0E.kP5.T.TR.2.,.Z386.QC18RdfQHfMf9R9vQ7ONb1E.kNE.0.p.0.4.I3tf
j1.0E65.ow.U.UHBm0s4Rd.8ssHomOrVyqqqqkuKmKKtCLLCJuo8.,Mw7ONh1.uGf.2UmT.6..
E1U.M361szPuH7OJNOF,7J9vQdPJdfNltCPM1HOHVuHZ8J,tIdfQHfPDvQrN1P6IZuH5OF7OJZ
OF,7FTf8rN1HcJ1eI,dQ9vQp76HeHdOFDOFZuCPM0HMORfC,NEZeI1OK,dAV76TeF,tIFuHZ8J
58G1eIrN1whpZiu2Y,J8MA,tHB8658G103OFDOGRO1HMOR96pND,d6376L76VNF78K,t8,7A18
Al86LVMRbBAVHZimBBoZJZia3bIxhHZimBBWmouKK0mrKLumGE8rmCrIin4aEY4IaGJIanQamR
qkWuIW0GWyqRqk4KIbGYIhgn,um4qE,GpmCLu.7uPPcUXDJ9X1xhiZimxhgZhZJinpZHlVGLtm
KWKqtCK.4Te..c95uPR9R.7ONbvM,kVkk.Um,..Unp3.6F6.ZD,6.636.M00U.2..AU0CyIVGh
ighgmRiiQ88pum470,Mwd0UnpZGhighA70,cw5.0.L3D.53,6.C6.QiiQ8CJuaLqKKWKqtCK.4
D.o3aLq.,cwF.,.E2Eh2.0.32.oZ,ZC.G259.G.0..676.16.6.665hKE.mLT5UTyB4.4.0E.c
UZj0E..UO2.2.A.c8U.E.0t.U...6d0...
--- end of encoding ---

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef, I tried your test with blackbox-1.7-a1.024.zip, but it returns 2. All seems to be Ok.

However out := LONG(in) cause the TRAP. Maybe it is to risky to make forced Latin1 to Unicode encoding?
Attachments
testOK.png
(76.46 KiB) Downloaded 5 times
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

I used the formulation
out := in$;
This does a truncation if in is too large for out.

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef Templ wrote:I used the formulation
out := in$;
This does a truncation if in is too large for out.

- Josef
The out in the test should be the same length as input. In such case, there is no problem.

Code: Select all

MODULE TestUtf8;

IMPORT Strings;

PROCEDURE Do*;
	VAR res: INTEGER;
		in: ARRAY 20 OF SHORTCHAR;
		out: ARRAY 20 OF CHAR;
BEGIN
	in := "" + 0EDX + 0A0X + 0X;
	Strings.Utf8ToString(in, out, res);
	HALT(99);
END Do;
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef, I understood (with help of Alexander)! You made test for truncation (res = 1). But also put incorrect input. In this case procedure should detect incorrect input. But if input is correct but output is not long enough it should return 1.

Alexander is suggesting one more incorrect input: 0EDX + 0A1X + 8CX + 0EDX + 0BEX + 0B4X + 0X;

I put all tests together in one module.

Code: Select all

StdCoder.Decode ..,, ..WV....3Qw7uP5PRPPNR9Rbf9b8R79FTvMf1GomCrlAy2xhX,Cb2x
 hXhC6FU1xhiZiVBhihgmRiioedhgrZcZRiXFfaqmSrtuGfa4700zdGrr8rmCLLCJuyKtYcZRiX
 7.2.s,stD.,6.5Qw7uP51QCPuP7PNN9F9vQAy1xB.gdj,UBxhYhAbf9P0G2sIdvPZntgcghghZ
 cZRC8T0E.kOS.H.Pt,2.,U08J99SdfJHPNjvQCJuGKfaqmY6MwdONl1QCh0708T,U..w.wu4.,
 sUGpmWbBxhYhAbndMHT9NY6Mw.sQq2Y6cwB.0.1D,w,Av2E.0.oV,2.86.QC18RdfQHfMf9R9v
 Q7ONb17.,.b,,6.I16.M.6.JFyuv.U.2m,.Zw.E.c5Nf.OS28U0Cy2hgqRcjhhhBgiZgZJinpZ
 HZCh0E.4TWKKv.Uio8.,cw5.0.,,,.B.0UJUD.,.x.Umr,6.222.o.6.K,.,.x..U...B.0UJU
 C2.yzayIWKJaKIEGpmCLuKJuOKQin4qEIeGEeIL0GeKqq0LqmGE4IL0mdWqo8rw4qmOLK0mYuG
 EGomuqoCrrOLEOqr8LEaotCruKKECmMaHEemIqk4aoa0pb8Je0mdGLtaKrSqtmGEmorSqRqk40
 JdyoVKIWKJdKIEGpmMAJtCPcJ1eI,dQ9vQp76HeHdOFDOFZuCPM0HfPp761eIZOEn86Z7A,tHB
 86b8GTeIduEFOEZuCPkrKLueHE4Id.U13d,V0hc5BdChV7Ahi3YugbUIYW2Yf2Ykgc23fUQZU2
 a,3aM3YfEgin4akdGLta4umeGLnWHeyqdGLta4FNOR99,tPf9RN76ZPNbP8rN1H6HTvNRtIdfQ
 H13NGRvMTfQZPN59R,NOR9Qf9R,NAp763N8r76NuPDf93uPT9PFdQ9vQ,ND,dAHtC,,mIrin4K
 IbGIEGpmC5rN1P6I..8HJin4O3ER.U7ABp,.kd..T1..C2cE.QfkgfUIbx2Ykgc2tCPkoMAv86
 pNDAcl2fvgV7UmgfUIbx6C58KrN1H.bNL,dCvFMK2.dNL,dCcE98KrN1kOqJEe1YaMR5EPqJEe
 1WpRsIUn,.68H1.A3.b0c6QA.2CIau2YWA3U0,66.Ea..in4qEc..QaeQbBo8Uu..HMOe1.UH,
 .y4.66TeF,tEU0,Qeo,bnd.duPf8RB9CFd6NvPRvN,tQdfQ66BvPZHuKaoJYg2YdphgcQ9nIkd
 .bneEe.2ZdVj,.m2C3EFGJtKLrCqkGrmGKR0GFa0EV.224nIiHEm2m2.sArN1PMFR8FUJVigVB
 IUIhgn,YeZRCXN136JMJ.IaBIUIhAK3.bdUXDJ9X1xhiZimxhgZhZJinpZH7N58RZ9P7ONbvM,
 Mwd0.UiQcjpho,YcZRiX3.5011.85...CLL.U2V.Iy2U.UIU.U76.0E..k.8ssHpmcIf9P9fQb
 f9bWGhigFWE.4Te.sQRdIf9P9HWE.8z,U.kJl1kFF.0U10.bf9bWHZitZhZZcZRiX3Ul1.RVtZ
 BE.8z1U..2,I9,E.0.32.oZ,ZC.G259.G.0..676.16.6.665hKE.mLT5UTyB,M.M.6.,U0KyB
 .,..eF.E.k.Ue.0.,6Y1.0.UA2Tm.mmBjZ92T,eUXDFTXhhAsET1.UG2.2..606.k22.WtZCbU
 wYX8Utj00MyfUMwdc7cJ7a.Eb1...
 --- end of encoding ---
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

There can also be the case that the input is incorrect AND the output is not long enough.
Depending on what is detected first when parsing from left to right res will be 1 or 2.
If res is 2 and out := in$ is executed, that may also cause a truncation. It is unavoidable and
a very exceptional error case.

What is wrong with 0EDX + 0A1X + 8CX + 0EDX + 0BEX + 0B4X + 0X?

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

http://www.ietf.org/rfc/rfc3629.txt wrote:4. Syntax of UTF-8 Byte Sequences

For the convenience of implementors using ABNF, a definition of UTF-8
in ABNF syntax is given here.

A UTF-8 string is a sequence of octets representing a sequence of UCS
characters. An octet sequence is valid UTF-8 only if it matches the
following syntax, which is derived from the rules for encoding UTF-8
and is expressed in the ABNF of [RFC2234].

UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF

NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This
grammar is believed to describe the same thing Unicode describes, but
does not claim to be authoritative. Implementors are urged to rely
on the authoritative source, rather than on this ABNF.
In Oberon this will looks like:

Code: Select all

   UTF8-octets = *( UTF8-char )
   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
   UTF8-1      = 00X-7FX
   UTF8-2      = C2X-DFX UTF8-tail
   UTF8-3      = E0X A0X-BFX UTF8-tail / E1X-ECX 2( UTF8-tail ) / EDX 80X-9FX UTF8-tail / EEX-EFX 2( UTF8-tail )
   UTF8-4      = F0X 90X-BFX 2( UTF8-tail ) / F1X-F3X 3( UTF8-tail ) /  F4X 80X-8FX 2( UTF8-tail )
   UTF8-tail   = 80X-BFX
This string EDX + 0A1X + 8CX + 0EDX + 0BEX + 0B4X + 0X starts as UTF8-3 pattern EDX 80X-9FX UTF8-tail but second byte 0A1X doesn't belong to the range 80X-9FX. A1(=161) > 9F(=159). That is why this sequence incorrect!
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef Templ wrote:If res is 2 and out := in$ is executed, that may also cause a truncation. It is unavoidable and a very exceptional error case.
This is not normal truncation, because this is not truncation during Utf-8 conversation.
I think, that we should remove this code (out := LONG(in) or out := in$). Programmer should make honest exception handling. If Utf8ToString will return 2 he MUST to think what is wrong about the string. Latin-1 useful part (0-127, ASCII) can not give error, because it honestly fit the pattern UTF8-1 = 00X-7FX !!! Unfortunately old BlackBox support some symbols from the second part of Latin-1 :)
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

I cannot read this weird grammer notation.
According to Wikipedia it is a correct Utf-8 sequence
because the tail has the right number of bytes and all tail bytes have high bits 10.

- Josef
Post Reply