Page 1 of 2

Utf8ToString converter for Issue #19

Posted: Wed Nov 26, 2014 5:54 am
by DGDanforth
Ivan wrote:"Last Luowy's version seems ok and pass tests. So we can choose between Josef's version and Luowe's version based on Josef's."
I am creating this poll simply trusting and hoping that you all know what you are doing.
Without a great deal of time on my part I can not verify the correctness of the code.

So, let's do this and if needed in the future we can always modify.

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 6:33 am
by Ivan Denisov
"Unicode for Component Pascal identifiers" is mainly Helmut's solution.

We Just choosing between realization of Utf8ToString converter in Kernel.

The version by Josef is based on Helmut's version:

Code: Select all

PROCEDURE Utf8ToString* (IN in : ARRAY OF SHORTCHAR; OUT out : ARRAY OF CHAR; OUT res: INTEGER);
  VAR i, j, val, max : INTEGER; ch : SHORTCHAR;
  
  PROCEDURE FormatError();
  BEGIN out := in$; res := 2 (*format error*)
  END FormatError;
  
	BEGIN
   ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
   WHILE (ch # 0X) & (j < max) DO
     IF ch < 80X THEN
       out[j] := ch; INC(j)
     ELSIF ch < 0E0X THEN
       val := ORD(ch) - 192;
       IF val < 0 THEN FormatError; RETURN END ;
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
       IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
       out[j] := CHR(val); INC(j)
     ELSIF ch < 0F0X THEN 
       val := ORD(ch) - 224;
       IF val < 0 THEN FormatError; RETURN END ;
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
       IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
       IF (ch < 80X) OR (ch >= 0E0X) THEN FormatError; RETURN END ;
       out[j] := CHR(val); INC(j)
     ELSE
       FormatError; RETURN
     END ;
     ch := in[i]; INC(i)
   END;
   out[j] := 0X;
   IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
END Utf8ToString;
The version by Luowy is based on Josef's version.

Code: Select all

  PROCEDURE Utf8ToString* (IN in: ARRAY OF SHORTCHAR; OUT out: ARRAY OF CHAR; OUT res: INTEGER);
      VAR i, j, val, max: INTEGER; ch, ch0: SHORTCHAR;
   BEGIN
      ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
      WHILE (ch # 0X) & (j < max) DO
         IF ch < 80X THEN           (*1 byte   00-7F *)
            out[j] := ch; INC(j)
         ELSIF ch < 0E0X THEN  (* 2 bytes  C2-DF UTF8Tail *)
            val := ORD(ch) - 192; IF val < 2 (*0*) THEN out := ""; res := 2 RETURN END; 
            ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
            val := val * 64 + ORD(ch) - 128;
            out[j] := CHR(val); INC(j)
         ELSIF ch < 0F0X THEN  (* 3 bytes 1110xxxx 10xxxxxx 10xxxxxx *)
            val := ORD(ch) - 224; ch0 := ch; ch := in[i]; INC(i); 
            IF (ch0 = 0E0X)&(ch >= 0A0X)&(ch <= 0BFX) OR (ch0 = 0EDX)& (ch >= 80X)&(ch <= 9FX) 
               OR (ch0#0E0X)&(ch0#0EDX)&(ch >= 80X)&(ch <= 0BFX) THEN val := val * 64 + ORD(ch) - 128;
            ELSE out := ""; res := 2 RETURN 
            END;
            ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
            val := val * 64 + ORD(ch) - 128;
            out[j] := CHR(val); INC(j)
         ELSE(* 4 bytes *)
            out := ""; res := 2 RETURN
         END;
         ch := in[i]; INC(i)
      END;
      out[j] := 0X;
      IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
   END Utf8ToString;
The main difference is that Luowy's version check format of input according Unicode 7.0 standard and return res = 2 if found that input contains bad-formed UTF8. Also Luowy's version works a bit faster and do not return string copy (out := in$) when conversion fails.

Both versions does't support 4-bytes Utf-8 used for: musical symbols, rare Chinese characters, extinct forms of writing, 00110000 - 001FFFFF not used to Unicode.

One comment about "bad format" 3-bytes utf8:

- 1st illegal chars sequence is 0E0X + 080X-09FX + 080X-0BFX.
This sequence never can be the result of StringToUtf8 converter. If this code in input, there are security risks.

- 2nd illegal chars sequence is 0EDX + 0A0X-0BFX + 080X-0BFX
This sequence code "private" (UCS-2) or surrogates (UTF-16) chars U+D800…U+DBFF, that should not be converted to String.
Actually we do not now which Unicode encoding assumed in BlackBox, but it should be UCS-2 or UTF-16.

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 7:59 am
by Josef Templ
For those that are not yet sure which option to choose, here is a summary of the discussion:

BlackBox/CP supports the type CHAR as 2-byte Unicode character but has no concept of
'illegal' Unicode characters. Every 2-byte value can be assigned to a CHAR.
This was never a problem to anybody, as far as I know.
If we introduce checks for illegal characters in Utf8ToString we get the following problems:

- we introduce the concept of illegal Unicode characters in one isolated context:
if illegal Unicode characters are really a problem to anybody, this issue must be solved deeper in
the language and system, NOT ONLY in Utf8ToString.
- asymmetric behavior between StringToUtf8 and Utf8ToString:
what can be encoded, cannot be decoded
- some runtime overhead in the conversion:
luowy version is not faster, but slower.
- a more complex conversion procedure
- dependency on the 'current' Unicode standard:
changed several times in the past

- Josef

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 8:05 am
by Ivan Denisov
Josef Templ wrote: - some runtime overhead in the conversion:
luowy version is not faster, but slower.
That is not true. Please take a look here. The Luowy's version based on your's is faster. Please, download my test and try by your self.

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 8:35 am
by luowy
I have a question about Josef version:

Code: Select all

  ELSIF ch < 0F0X THEN
       val := ORD(ch) - 224;
       IF val < 0 THEN FormatError; RETURN END ; <<<<
       ch := in[i]; INC(i); val := val * 64 + ORD(ch) - 128;
the statement >>>IF val < 0 THEN FormatError; RETURN END ;<<<is necessary?
at this position,the ch must be in range [E0..EF], ORD(ch) -224(*0E0H*) must be >=0,
no need to do this check,I think.


luowy

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 8:57 am
by Josef Templ
> the statement >>>IF val < 0 THEN FormatError; RETURN END ;<<<is necessary?

Right, this check can be removed safely.

- Josef

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 9:21 am
by luowy
the code out := in$; is necessary?
why not do: out:='';
if the LEN(out) is short,it will get a trap in the kernel module.


luowy

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Thu Nov 27, 2014 9:05 am
by Josef Templ
> the code out := in$; is necessary?
> why not do: out:='';

error handling is actually a separate topic.
The same approach can be applied whatever kind of checks are performed.

At the beginning there was only the truncation error.
out contained the fitting left part of the encoded/decoded in.
In most cases truncation must be treated as an error, in some, e.g. when logging
for debugging purposes it may be appropriate to ignore it.
Now with the format error there is the same situation.
In most cases this must be treated as an error anyway and the contents of out is irrelevant,
in some cases, e.g when logging for debugging purposes it may be appropriate to ignore it.
Then out should also contain something. The simplest thing to do is to copy in.
When you look at a possible source for a format error, viz. to decode a string that has not
been encoded before, e.g. because it comes from BB1.6, this also gives a meaningful result.

> if the LEN(out) is short,it will get a trap in the kernel module.

There is no TRAP on truncation because a string ($) assignment is used for copying in to out.
It should be clear to anybody that there is no need to optimize the error handling in terms of execution speed.
Errors are exceptional situations that normally terminate the running command anyway.

- Josef

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Dec 03, 2014 5:10 am
by Ivan Denisov
http://forum.blackboxframework.org/whod ... php?id=154

Did not vote
OberonCore
ReneK
akastargazer
warnersoft

Re: Issue #19 : Unicode for Component Pascal identifiers

Posted: Wed Dec 03, 2014 5:13 am
by DGDanforth
Ivan Denisov wrote:http://forum.blackboxframework.org/whod ... php?id=154

Did not vote
OberonCore
ReneK
akastargazer
warnersoft
English translation: Those who have not voted yet.