issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

I just read Helmut's comment in the other posting and see he does convert 16 bit input to Utf8 output.
So everything is now clear and consistent.

Sorry for the prolonged discussions
(I hope they helped others in their understanding of what is being done to provide Unicode support for identifiers).

-Doug
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

To understand the extent of the changes necessary to include Helmult's Utf8 internal encoding I searched his most recent release BB1.7 and found 25 files which mentioned Utf8 (the number in square brackets [] specifies the number of times Utf8 occurs in the file). Here is the list

Location
Ctl/Mod/Office.odc
utf8 [1]
Dev/Mod/Analyzer.odc
utf8 [3]
Dev/Mod/Browser.odc
utf8 [13]
Dev/Mod/ComDebug.odc
utf8 [3]
Dev/Mod/Commanders.odc
utf8 [2]
Dev/Mod/CPB.odc
utf8 [4]
Dev/Mod/CPE.odc
utf8 [1]
Dev/Mod/CPM.odc
utf8 [9]
Dev/Mod/CPP.odc
utf8 [3]
Dev/Mod/CPS.odc
utf8 [2]
Dev/Mod/CPT.odc
utf8 [6]
Dev/Mod/Debug.odc
utf8 [25]
Dev/Mod/Dependencies.odc
utf8 [2]
Dev/Mod/HeapSpy.odc
utf8 [2]
Dev/Mod/Linker.odc
utf8 [8]
Dev/Mod/MsgSpy.odc
utf8 [2]
Dev/Mod/Packer.odc
utf8 [2]
Dev/Mod/Profiler.odc
utf8 [3]
Std/Mod/Debug.odc
utf8 [13]
Std/Mod/Loader.odc
utf8 [7]
System/Mod/Kernel.odc
utf8 [9]
System/Mod/Meta.odc
utf8 [8]
System/Mod/Services.odc
utf8 [2]
System/Mod/Strings.odc
utf8 [6]
Xhtml/Mod/StdFileWriters.odc
utf-8 [2]

It appears that DevCPM duplicates the code of Kernel (for Utf8 conversion). Is that necessary?
Also, not all references to Utf8 may be due to the desire to have identifies in Unicode.
For example in StdFileWriters a reference is

Code: Select all

		String(wr, '<?xml version="1.0" encoding="UTF-8"?>'); wr.Ln;
I hope this information is useful to others.
-Doug
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Alexander Shiryaev improved UTF-8 decoder:
http://forum.oberoncore.ru/viewtopic.ph ... 92b#p89557
According Security Considerations of RFC3629:
http://www.ietf.org/rfc/rfc3629.txt (10. Security Considerations)

Differences:
http://redmine.blackboxframework.org/pr ... 1&type=sbs

This version:
blackbox-1.7-a1.022.zip

I made small general testing with different languages:
test2.png
test2.png (37.71 KiB) Viewed 17855 times
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Zinn »

Ivan I don’t understand why you change the Kernel.Utf8ToString implementation.
What is wrong with the original version?

Please send me your test program as Std.Coded file.
What is the difference behavior of using this program with the original and the changed implementation?
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Zinn wrote:Ivan I don’t understand why you change the Kernel.Utf8ToString implementation.
What is wrong with the original version?

Please send me your test program as Std.Coded file.
What is the difference behavior of using this program with the original and the changed implementation?
The original version is not safe. http://www.ietf.org/rfc/rfc3629.txt (10. Security Considerations)
Implementers of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.
The security tests Alexander made few years ago here: http://forum.oberoncore.ru/viewtopic.ph ... =20#p75707

My simple test does not demonstrate security problems. It just for see that all working good with identifiers.
This test will work well with both Kernel.Utf8ToString realisations.

Code: Select all

StdCoder.Decode ..,, ..KN....3Qw7uP5PRPPNR9Rbf9b8R79FTvMf1GomCrlAy2xhX,Cb2x
 hXhC6FU1xhiZiVBhihgmRiioedhgrZcZRiXFfaqmSrtuGfa4700zdGrr8rmCLLCJuyKtYcZRiX
 7.2.s,sq9.0k,5TWyql.bnayKmKKqGomC5XzET1.PuP.MHT9N9ntumaU2,CJuyKtQC98P9PP7O
 NbXmb.2.AZ3k2kEK.,6.,U08J99SdfJHPNjvQCJuGKfaqmY6MwdONl1QCh0708T,U..w.Qt2U.
 sUGpmWbBxhYhAbndMHT9NY6Mw.sQq2Y6cwB.0.fi.w,gt0E.2.4Ed4.86.QC18RdfQHfMf9R9v
 Q7ONb17.,.H4,6.Y22.M.,.5uPffQHPNZ96RONjHA0z.U.2m,.D.,6uzzzzL4E.uzzzzzR06yz
 zzzr3UezzzzTb.Gzzzzz3,2UkzzzzTd.,6wE0E.WD.,cu.2Uy19.0DQ,czE3Uo1F.,6uk,U.U6
 qq06Ibe.8ssHomOrVyqqqqkuKmKKtCLLCJuo8.,Mw7ONh1.uGf.2UmT.6..E1U.M36uk.E.GDy
 zayIWKJaKIEyF01I0vH09H0LH01I0fH01H0fF0vH0HH01H0jn4ak4akYqIcyIdGJE4QU0GRqHE
 morSqRqk4akVyIbCJeqk2akU1o8OJQiH6TIE0mS0GOKHPin4ak2WR6HR61S6rR69R6fR6XR6HS
 6HR65R6XR6rnM8nRqk2gV72eGxd1hc2heGhcUAY2aa2ia2Sb24a2Cb2KZvgV7oe,JeUwc4at4C
 a44d4yp4ac4at4Cc4KbUAdCZe3xc3JevgV7Ic3xc7pdBAV7wc4at40.2YugbUAs.hWK,QZU2vE
 auE4wE0.AOQbBAV7AMRNGR9RFtFBAn..QZUYapoadQbBAV7UAphvgV7gcCZcUAY2a4.QbBAVBg
 cCZ6z6V,0...RN1Pc.z6V,8V...M67AB72U0CyIV1xhiZimxhgZhZJinpZHlVGLtmKWKqtCK.4
 Te..c95uPR9R.7ONbvM,kVkk.Um,..Unp3.6F6.ZD,6.636.M00.,..1cUXDJ9XGhighgmRiiQ
 88pum470,Mwd0UnpZGhighA70,cw5.0.L3D.53,6.C6.QiiQ8CJuaLqKKWKqt2Ul1.RVtZBE.8
 z12.0.E2EhU.E,,.RNEd1U2V.6,6..UYU.AU.U.UUQoO,,Mg5T.ytrM.M.6.,U0KyB.,..e,2.
 A.c8WFs5.2UEC.6..mEw7169rwKiEw3c0Cy2xBqqmU1xB..8,2..606.k22.0sfCbgAYX8Utj0
 0MyfU.Qfc7f77a.bQ0...
 --- end of encoding ---
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

Ivan, your Utf8 parser is way too complicated.
It must be possible to implement it with only a few lines of
changes based on Helmut's version.
There is certainly no need to introduce an automaton (state machine)
because the checks can be inlined easily.

In general, there are no security issues involved.
It is matter of correctness. If an illegal Utf8 is parsed, that should be
recognized and returned with res = 2. out should be set to LONG(in) in that case.
Why? because it most probably is a conversion from an ISO Latin-1 shortstring
and it is the only thing we can do in that case.
Please have a look at the updated System/Docu/Strings file.

- Josef
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Zinn »

To change the line

ELSIE
val := ORD(ch) - 224;
ch := in; INC(i); val := val * 64 + ORD(ch) - 128;
ch := in; INC(i); val := val * 64 + ORD(ch) - 128;
out[j] := CHR(val); INC(j)
END;

Against

ELSIF ch < 0F0X THEN
val := ORD(ch) - 224;
ch := in; INC(i); val := val * 64 + ORD(ch) - 128;
ch := in; INC(i); val := val * 64 + ORD(ch) - 128;
out[j] := CHR(val); INC(j)
ELSE
out := LONG(in);
res := 2;
RETURN;
END

Is that the solution?
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

I think that we need some additional checks as outlined below.
This is untested code but it should show the idea.
Can it be simplified/sped up?
Please note that it assumes Valid(in), which is a usual precondition
for string or shortstring operations.

Code: Select all

PROCEDURE Utf8ToString* (IN in : ARRAY OF SHORTCHAR; OUT out : ARRAY OF CHAR; 
                    OUT res: INTEGER);
  VAR i, j, val, max : INTEGER; ch : SHORTCHAR;
  
  PROCEDURE FormatError();
  BEGIN out := in$; res := 2 (*format error*)
  END FormatError;
  
BEGIN
  ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
  WHILE (ch # 0X) & (j < max) DO
    IF ch < 80X THEN
      out[j] := ch; INC(j)
    ELSIF ch < 0E0X THEN
      val := ORD(ch) - 192;
      IF val < 0 THEN FormatError; RETURN END ;
      ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
      IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
      out[j] := CHR(val); INC(j)
    ELSIF ch < 0F0X THEN 
      val := ORD(ch) - 224;
      IF val < 0 THEN FormatError; RETURN END ;
      ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
      IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
      ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
      IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
      out[j] := CHR(val); INC(j)
    ELSE
      FormatError; RETURN
    END ;
    ch := in[i]; INC(i)
  END;
  out[j] := 0X;
  IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
END Utf8ToString;
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef, your code will not detect incorrect sequence: 0EDX 0A0X-0BFX
http://www.ietf.org/rfc/rfc3629.txt (4. Syntax of UTF-8 Byte Sequences)

Alexander update this according Josef documentation changes. Also he fix one error in his code about incorrect chars range.
He said (in skype) that automaton (state machine) is simplest solution here.

You can try this here:
blackbox-1.7-a1.024.zip

The diff with previous Alexander version

The diff with original
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

Ivan, please test more carefully.
May be you forgot to link the modified Kernel to the .exe file.
My tests show that your example correctly reports an illegal Utf8-format.

- Josef
Post Reply