Utf8ToString converter for Issue #19

Issue #19: Unicode for Component Pascal identifiers - luowy or Josef's solution

ABSTAIN
5
45%
luowy's solution
2
18%
Josef's solution
4
36%
 
Total votes: 11

User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Utf8ToString converter for Issue #19

Post by DGDanforth »

Ivan wrote:"Last Luowy's version seems ok and pass tests. So we can choose between Josef's version and Luowe's version based on Josef's."
I am creating this poll simply trusting and hoping that you all know what you are doing.
Without a great deal of time on my part I can not verify the correctness of the code.

So, let's do this and if needed in the future we can always modify.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Ivan Denisov »

"Unicode for Component Pascal identifiers" is mainly Helmut's solution.

We Just choosing between realization of Utf8ToString converter in Kernel.

The version by Josef is based on Helmut's version:

Code: Select all

PROCEDURE Utf8ToString* (IN in : ARRAY OF SHORTCHAR; OUT out : ARRAY OF CHAR; OUT res: INTEGER);
  VAR i, j, val, max : INTEGER; ch : SHORTCHAR;
  
  PROCEDURE FormatError();
  BEGIN out := in$; res := 2 (*format error*)
  END FormatError;
  
	BEGIN
   ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
   WHILE (ch # 0X) & (j < max) DO
     IF ch < 80X THEN
       out[j] := ch; INC(j)
     ELSIF ch < 0E0X THEN
       val := ORD(ch) - 192;
       IF val < 0 THEN FormatError; RETURN END ;
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
       IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
       out[j] := CHR(val); INC(j)
     ELSIF ch < 0F0X THEN 
       val := ORD(ch) - 224;
       IF val < 0 THEN FormatError; RETURN END ;
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
       IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
       IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
       out[j] := CHR(val); INC(j)
     ELSE
       FormatError; RETURN
     END ;
     ch := in[i]; INC(i)
   END;
   out[j] := 0X;
   IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
END Utf8ToString;
The version by Luowy is based on Josef's version.

Code: Select all

  PROCEDURE Utf8ToString* (IN in: ARRAY OF SHORTCHAR; OUT out: ARRAY OF CHAR; OUT res: INTEGER);
      VAR i, j, val, max: INTEGER; ch, ch0: SHORTCHAR;
   BEGIN
      ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
      WHILE (ch # 0X) & (j < max) DO
         IF ch < 80X THEN           (*1 byte   00-7F *)
            out[j] := ch; INC(j)
         ELSIF ch < 0E0X THEN  (* 2 bytes  C2-DF UTF8Tail *)
            val := ORD(ch) - 192; IF val < 2 (*0*) THEN out := ""; res := 2 RETURN END; 
            ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
            val := val * 64 + ORD(ch) - 128;
            out[j] := CHR(val); INC(j)
         ELSIF ch < 0F0X THEN  (* 3 bytes 1110xxxx 10xxxxxx 10xxxxxx *)
            val := ORD(ch) - 224; ch0 := ch; ch := in[i]; INC(i); 
            IF (ch0 = 0E0X)&(ch >= 0A0X)&(ch <= 0BFX) OR (ch0 = 0EDX)& (ch >= 80X)&(ch <= 9FX) 
               OR (ch0#0E0X)&(ch0#0EDX)&(ch >= 80X)&(ch <= 0BFX) THEN val := val * 64 + ORD(ch) - 128;
            ELSE out := ""; res := 2 RETURN 
            END;
            ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
            val := val * 64 + ORD(ch) - 128;
            out[j] := CHR(val); INC(j)
         ELSE(* 4 bytes *)
            out := ""; res := 2 RETURN
         END;
         ch := in[i]; INC(i)
      END;
      out[j] := 0X;
      IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
   END Utf8ToString;
The main difference is that Luowy's version check format of input according Unicode 7.0 standard and return res = 2 if found that input contains bad-formed UTF8. Also Luowy's version works a bit faster and do not return string copy (out := in$) when conversion fails.

Both versions does't support 4-bytes Utf-8 used for: musical symbols, rare Chinese characters, extinct forms of writing, 00110000 - 001FFFFF not used to Unicode.

One comment about "bad format" 3-bytes utf8:

- 1st illegal chars sequence is 0E0X + 080X-09FX + 080X-0BFX.
This sequence never can be the result of StringToUtf8 converter. If this code in input, there are security risks.

- 2nd illegal chars sequence is 0EDX + 0A0X-0BFX + 080X-0BFX
This sequence code "private" (UCS-2) or surrogates (UTF-16) chars U+D800…U+DBFF, that should not be converted to String.
Actually we do not now which Unicode encoding assumed in BlackBox, but it should be UCS-2 or UTF-16.
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Josef Templ »

For those that are not yet sure which option to choose, here is a summary of the discussion:

BlackBox/CP supports the type CHAR as 2-byte Unicode character but has no concept of
'illegal' Unicode characters. Every 2-byte value can be assigned to a CHAR.
This was never a problem to anybody, as far as I know.
If we introduce checks for illegal characters in Utf8ToString we get the following problems:

- we introduce the concept of illegal Unicode characters in one isolated context:
if illegal Unicode characters are really a problem to anybody, this issue must be solved deeper in
the language and system, NOT ONLY in Utf8ToString.
- asymmetric behavior between StringToUtf8 and Utf8ToString:
what can be encoded, cannot be decoded
- some runtime overhead in the conversion:
luowy version is not faster, but slower.
- a more complex conversion procedure
- dependency on the 'current' Unicode standard:
changed several times in the past

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef Templ wrote: - some runtime overhead in the conversion:
luowy version is not faster, but slower.
That is not true. Please take a look here. The Luowy's version based on your's is faster. Please, download my test and try by your self.
luowy
Posts: 234
Joined: Mon Oct 20, 2014 12:52 pm

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by luowy »

I have a question about Josef version:

Code: Select all

  ELSIF ch < 0F0X THEN
       val := ORD(ch) - 224;
       IF val < 0 THEN FormatError; RETURN END ; <<<<
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
the statement >>>IF val < 0 THEN FormatError; RETURN END ;<<<is necessary?
at this position,the ch must be in range [E0..EF], ORD(ch) -224(*0E0H*) must be >=0,
no need to do this check,I think.


luowy
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Josef Templ »

> the statement >>>IF val < 0 THEN FormatError; RETURN END ;<<<is necessary?

Right, this check can be removed safely.

- Josef
luowy
Posts: 234
Joined: Mon Oct 20, 2014 12:52 pm

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by luowy »

the code out := in$; is necessary?
why not do: out:='';
if the LEN(out) is short,it will get a trap in the kernel module.


luowy
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Josef Templ »

> the code out := in$; is necessary?
> why not do: out:='';

error handling is actually a separate topic.
The same approach can be applied whatever kind of checks are performed.

At the beginning there was only the truncation error.
out contained the fitting left part of the encoded/decoded in.
In most cases truncation must be treated as an error, in some, e.g. when logging
for debugging purposes it may be appropriate to ignore it.
Now with the format error there is the same situation.
In most cases this must be treated as an error anyway and the contents of out is irrelevant,
in some cases, e.g when logging for debugging purposes it may be appropriate to ignore it.
Then out should also contain something. The simplest thing to do is to copy in.
When you look at a possible source for a format error, viz. to decode a string that has not
been encoded before, e.g. because it comes from BB1.6, this also gives a meaningful result.

> if the LEN(out) is short,it will get a trap in the kernel module.

There is no TRAP on truncation because a string ($) assignment is used for copying in to out.
It should be clear to anybody that there is no need to optimize the error handling in terms of execution speed.
Errors are exceptional situations that normally terminate the running command anyway.

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Ivan Denisov »

http://forum.blackboxframework.org/whod ... php?id=154

Did not vote
OberonCore
ReneK
akastargazer
warnersoft
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by DGDanforth »

Ivan Denisov wrote:http://forum.blackboxframework.org/whod ... php?id=154

Did not vote
OberonCore
ReneK
akastargazer
warnersoft
English translation: Those who have not voted yet.
Locked