issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Josef Templ wrote:> No, Josef, I think you are making the mistake here. All Unicode is covered by this well-formed UTF8. So there are no violation of Component Pascal string definition. If forbidden sequences appears - this is hacker work or file damaged. I made Test5 procedure for demonstrate this.

Ivan, there is not only Utf8ToString but also StringToUtf8.
They should be symmetric in their behavior, i.e. a string encoded by
StringToUtf8 should be decoded by Utf8ToString.

- Josef
Yes! StringToUtf8 can not give forbidden sequences! It will use shorter representation.

For example, for letter a the StringToUtf8 procedure will return 97 and never forbidden 224 129 161.

But your version of Utf8ToString will return a for 224 129 161. This should not happen according the suggestion you just did.

I will check this now by strict test. Maybe there are some exceptions.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Ivan Denisov wrote:Yes! StringToUtf8 can not give forbidden sequences!
I said wrong :(
From char 55296 to 57343 StringToUtf8 generates bad-formed Utf-8: 0EDX 0A0X-0BFX 080X-0BFX
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Ivan Denisov wrote:
Ivan Denisov wrote:Yes! StringToUtf8 can not give forbidden sequences!
I said wrong :(
From char 55296 to 57343 StringToUtf8 generates bad-formed Utf-8: 0EDX 0A0X-0BFX 080X-0BFX
I found the answer here: https://en.wikipedia.org/wiki/UTF-16
This region from 55296 to 57343 = from U+D800 to U+DFFF.
Within this plane, code points U+D800 to U+DFFF (see below) were never assigned character values in UCS-2, and in UTF-16 are reserved for high and low surrogates used to encode codepoint values greater than U+FFFF.
In UCS-2 this region was used for "private" use.
Now in UTF-16 it used for surrogates for coding values greater than U+FFFF.
We should not transform them to UTF8 and fix StringToUtf8!
luowy
Posts: 234
Joined: Mon Oct 20, 2014 12:52 pm

Re: Issue #19: Unicode for Component Pascal identifiers

Post by luowy »

This is my multi version for utf8,Whether one proc of this module can be accepted by kernel I don't mind,
hope you can check it quickly: whatever it can accept a 4 bytes utf8,maybe useful for other thing;

this is a draft,I rewrite it for this post just now.

we spend too much time on this issue.I hope we can finish it as soon as possible and go forward.

Code: Select all


MODULE CpcMyUtf8;


	(* 
		accept 4 bytes utf8,may decode full legal utf8 bytes sequence to utf-16 char string,  
		which maybe displayed correctly by winapi proceduresW  
	*)
	
	CONST 
		Truncated = 1;
		IllegalBytes = 2; IllegalChars = 2;
		Surrogated = 4;
	(*
		res =0:  ok
		ODD(res ):  truncated  string 
		ODD(res MOD 2): illegal bytes, illegal  string  
		ODD(res MOD 4): has surrogate ,illegal  identifier !
	*)
	
	CONST ReplaceChar = 0FFFDX;
	
	(*full decode: decode sigle or multi illegal bytes to single ReplaceChar *)
	PROCEDURE Utf8ToString* (IN utf8: ARRAY OF SHORTCHAR; OUT str: ARRAY OF CHAR; OUT res: INTEGER);
		VAR 
			x, i, j, max: INTEGER;
			state: INTEGER; (*  e mm s tt          e:err     m:mode    s:surrogate   t:tail bytes *)
			ch: SHORTCHAR;  
			surrogated, truncated: BOOLEAN;
	BEGIN 
		res := 0; truncated := FALSE; surrogated := FALSE; 
		max := LEN(str) - 1; j := 0;
		ch := utf8[0]; i := 0; state := 0;
		WHILE (ch # 0X) & (j < max) DO 
			IF state = 0 THEN 
				CASE ch OF 
				| 1X(*0X*)..7FX: (* 1 bytes*)
					str[j] := ch; INC(j);
				| 0C2X..0DFX: (* 2 bytes *)
					x := ORD(ch) - 192; state := 1;             (* 0 00 0 01*) (* e=0 m=0 s=0 t=1  *)
				| 0E0X..0EFX: (* 3 bytes*)
					x := ORD(ch) - 224; 
					IF ch = 0E0X THEN state := 10;           (*0 01 0 10   *) (* e=0 m=1 s=0 t=2  *)
					ELSIF ch = 0EDX THEN state := 18;    (*0 10 0 10  *) (* e=0 m=2 s=0 t=2  *)
					ELSE state := 2;                                    (*0 0 0 10  *) (* e=0 m=0 s=0 t=2  *)
					END;
				| 0F0X..0F4X: (*4 bytes *)
					x := ORD(ch) - 240; 
					IF ch = 0F0X THEN state := 15;           (*0 01 1 11 *) (* e=0 m=1 s=1 t=3  *)
					ELSIF ch = 0F4X THEN state := 23;    (*0 10 1 11 *)  (* e=0 m=2 s=1 t=3  *)
					ELSE state := 7;                                   (* 0 00 1 11 *) (* e=0 m=0 s=1 t=3  *)
					END;
				ELSE (* single illegal byte *)
					str[j] := ReplaceChar; INC(j); res := IllegalBytes;
				END;
			ELSE 
				CASE state OF
				| 10: IF (ch < 0A0X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 18: IF (ch < 080X)OR(ch > 09FX) THEN INC(state, 32); END;
				| 15: IF (ch < 090X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 23: IF (ch < 080X)OR(ch > 08FX) THEN INC(state, 32); END;
				ELSE 
					IF (ch < 080X)OR(ch > 0BFX) THEN INC(state, 32); END;
				END; 
				IF state < 32 THEN 
					DEC(state); x := x * 64 + ORD(ch) - 128; 
					IF state MOD 4 = 0 THEN 
						IF ODD(state DIV 4) THEN 
							surrogated := TRUE; 
							DEC(x, 10000H);
							str[j] := CHR(0D800H + x DIV 400H); INC(j);
							str[j] := CHR(0DC00H + x MOD 400H); INC(j);  (*maybe j>max  *)
						ELSE
							str[j] := CHR(x); INC(j); 
						END;
						state := 0;
					END;
				ELSE(*multi illegal bytes*)
					str[j] := ReplaceChar; INC(j); DEC(i);  (* not consume illegal tail byte *)
					res := IllegalBytes; state := 0;
				END;
			END;
			INC(i); ch := utf8[i];
		END;
		IF state # 0 THEN str[j] := ReplaceChar; INC(j); res := IllegalBytes; END;        (*unfinished multi legal bytes ==>illegal bytes  *)
		IF (ch # 0X) OR(j > max) THEN truncated := TRUE; j := max; END;
		str[j] := 0X; 
		IF truncated THEN INC(res, Truncated); END; 
		IF surrogated THEN INC(res, Surrogated); END;
	END Utf8ToString;
	
	
	(*short decode: not decode anymore when find a illegal byte *)
	PROCEDURE Utf8ToStringShort* (IN utf8: ARRAY OF SHORTCHAR; OUT str: ARRAY OF CHAR; OUT res: INTEGER);
		VAR 
			x, i, j, max: INTEGER;
			state: INTEGER; (*  e mm s tt          e:err     m:mode    s:surrogate   t:tail bytes *)
			ch: SHORTCHAR;  
			surrogated, truncated: BOOLEAN;
	BEGIN 
		res := 0; truncated := FALSE; surrogated := FALSE; 
		max := LEN(str) - 1; j := 0;
		ch := utf8[0]; i := 0; state := 0;
		WHILE (ch # 0X) & (j < max) DO 
			IF state = 0 THEN 
				CASE ch OF 
				| 1X(*0X*)..7FX: (* 1 bytes*)
					str[j] := ch; INC(j);
				| 0C2X..0DFX: (* 2 bytes *)
					x := ORD(ch) - 192; state := 1;             (* 0 00 0 01*) (* e=0 m=0 s=0 t=1  *)
				| 0E0X..0EFX: (* 3 bytes*)
					x := ORD(ch) - 224; 
					IF ch = 0E0X THEN state := 10;           (*0 01 0 10   *) (* e=0 m=1 s=0 t=2  *)
					ELSIF ch = 0EDX THEN state := 18;    (*0 10 0 10  *) (* e=0 m=2 s=0 t=2  *)
					ELSE state := 2;                                    (*0 0 0 10  *) (* e=0 m=0 s=0 t=2  *)
					END;
				| 0F0X..0F4X: (*4 bytes *)
					x := ORD(ch) - 240; 
					IF ch = 0F0X THEN state := 15;           (*0 01 1 11 *) (* e=0 m=1 s=1 t=3  *)
					ELSIF ch = 0F4X THEN state := 23;    (*0 10 1 11 *)  (* e=0 m=2 s=1 t=3  *)
					ELSE state := 7;                                   (* 0 00 1 11 *) (* e=0 m=0 s=1 t=3  *)
					END;
				ELSE (* single illegal byte *)
					(*str[j] := ReplaceChar; INC(j); *)res := IllegalBytes; RETURN; ;
				END;
			ELSE 
				CASE state OF
				| 10: IF (ch < 0A0X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 18: IF (ch < 080X)OR(ch > 09FX) THEN INC(state, 32); END;
				| 15: IF (ch < 090X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 23: IF (ch < 080X)OR(ch > 08FX) THEN INC(state, 32); END;
				ELSE 
					IF (ch < 080X)OR(ch > 0BFX) THEN INC(state, 32); END;
				END; 
				IF state < 32 THEN 
					DEC(state); x := x * 64 + ORD(ch) - 128; 
					IF state MOD 4 = 0 THEN 
						IF ODD(state DIV 4) THEN 
							surrogated := TRUE; 
							DEC(x, 10000H);
							str[j] := CHR(0D800H + x DIV 400H); INC(j);
							str[j] := CHR(0DC00H + x MOD 400H); INC(j);  (*maybe j>max  *)
						ELSE
							str[j] := CHR(x); INC(j); 
						END;
						state := 0;
					END;
				ELSE(*multi illegal bytes*)
					res := IllegalBytes; RETURN;
				END;
			END;
			INC(i); ch := utf8[i];
		END;
		IF state # 0 THEN res := IllegalBytes; RETURN; END;        (*unfinished multi legal bytes ==>illegal bytes  *)
		IF (ch # 0X) OR(j > max) THEN res := Truncated; RETURN; END;
		str[j] := 0X; 
		IF surrogated THEN res := Surrogated; END;
	END Utf8ToStringShort;
	
		
	(* encode utf-16 string,the illegal byte decoded by Utf8ToString proc  cant be recovered correctly *)
	PROCEDURE StringToUtf8* (IN str: ARRAY OF CHAR; OUT utf8: ARRAY OF SHORTCHAR; OUT res: INTEGER);
		VAR i, j, val, max: INTEGER; (*surr:BOOLEAN;*)
	BEGIN res := 0;(* surr:=FALSE;*)
		i := 0; j := 0; max := LEN(utf8) - 4;
		WHILE (str[i] # 0X) & (j < max) DO
			val := ORD(str[i]); INC(i);
			IF (val <= 0D800H) &(val < 0E000H)THEN (*check surrogate *)
				IF (val <= 0D800H) &(val < 0DC00H) &(str[i] >= 0DC00X)&(str[i] < 0E000X) THEN  (*surr:=TRUE;*)
				ELSE val := ORD(ReplaceChar); ; res := IllegalChars; (* illegal char*)
				END;
			END;
			IF val < 128 THEN
				utf8[j] := SHORT(CHR(val)); INC(j)
			ELSIF val < 2048 THEN
				utf8[j] := SHORT(CHR(val DIV 64 + 192)); INC(j);
				utf8[j] := SHORT(CHR(val MOD 64 + 128)); INC(j)
			ELSIF (val <= 0D800H) &(val < 0DC00H)THEN 
				val := (val - 0D800H) * 400H + (ORD(str[i]) - 0DC00H) + 10000H;
				utf8[j] := SHORT(CHR(0F0H + val DIV 40000H)); INC(j);
				utf8[j] := SHORT(CHR(80H + val DIV 4096 MOD 64)); INC(j);
				utf8[j] := SHORT(CHR(80H + val DIV 64 MOD 64)); INC(j);
				utf8[j] := SHORT(CHR(80H + val MOD 64)); INC(j);
				INC(i);
			ELSE(**)
				utf8[j] := SHORT(CHR(val DIV 4096 + 224)); INC(j); 
				utf8[j] := SHORT(CHR(val DIV 64 MOD 64 + 128)); INC(j);
				utf8[j] := SHORT(CHR(val MOD 64 + 128)); INC(j)
			END;
		END;
		utf8[j] := 0X;
		IF str[i] # 0X THEN INC(res, Truncated) END;
		(*IF surr THEN INC(res, Surrogated); END;*)
	END StringToUtf8;
	
	
	
	(* full decode:  PEP383 version, each illegal byte decode to DCXX *)
	PROCEDURE Utf8ToString2* (IN utf8: ARRAY OF SHORTCHAR; OUT str: ARRAY OF CHAR; OUT res: INTEGER);
		VAR 
			x, i, j, max: INTEGER;
			state: INTEGER; (*  e mm s tt          e:err     m:mode    s:surrogate   t:tail bytes *)
			ch: SHORTCHAR;  
			surrogated, truncated: BOOLEAN;
	
		VAR (*PEP383*)
			buf: ARRAY 3 OF SHORTCHAR; (* buffered valid bytes*)
			n: INTEGER;

	BEGIN 
		res := 0; truncated := FALSE; surrogated := FALSE; 
		max := LEN(str) - 1; j := 0;
		ch := utf8[0]; i := 0; state := 0;
		WHILE (ch # 0X) & (j < max) DO 
			IF state = 0 THEN 
				buf[0] := ch; n := 1;
				CASE ch OF 
				| 1X(*0X*)..7FX: (* 1 bytes*)
					str[j] := ch; INC(j);
				| 0C2X..0DFX: (* 2 bytes *)
					x := ORD(ch) - 192; state := 1;             (* 0 00 0 01*) (* e=0 m=0 s=0 t=1  *)
				| 0E0X..0EFX: (* 3 bytes*)
					x := ORD(ch) - 224; 
					IF ch = 0E0X THEN state := 10;           (*0 01 0 10   *) (* e=0 m=1 s=0 t=2  *)
					ELSIF ch = 0EDX THEN state := 18;    (*0 10 0 10  *) (* e=0 m=2 s=0 t=2  *)
					ELSE state := 2;                                    (*0 0 0 10  *) (* e=0 m=0 s=0 t=2  *)
					END;
				| 0F0X..0F4X: (*4 bytes *)
					x := ORD(ch) - 240; 
					IF ch = 0F0X THEN state := 15;           (*0 01 1 11 *) (* e=0 m=1 s=1 t=3  *)
					ELSIF ch = 0F4X THEN state := 23;    (*0 10 1 11 *)  (* e=0 m=2 s=1 t=3  *)
					ELSE state := 7;                                   (* 0 00 1 11 *) (* e=0 m=0 s=1 t=3  *)
					END;
				ELSE (* single illegal byte *)
					str[j] := CHR(0DC00H + ORD(ch)); INC(j); res := IllegalBytes;
				END;
			ELSE 
				CASE state OF
				| 10: IF (ch < 0A0X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 18: IF (ch < 080X)OR(ch > 09FX) THEN INC(state, 32); END;
				| 15: IF (ch < 090X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 23: IF (ch < 080X)OR(ch > 08FX) THEN INC(state, 32); END;
				ELSE 
					IF (ch < 080X)OR(ch > 0BFX) THEN INC(state, 32); END;
				END; 
				IF state < 32 THEN 
					DEC(state); x := x * 64 + ORD(ch) - 128; 
					IF state MOD 4 = 0 THEN 
						IF ODD(state DIV 4) THEN surrogated := TRUE; 
							DEC(x, 10000H);
							str[j] := CHR(0D800H + x DIV 400H); INC(j);
							str[j] := CHR(0DC00H + x MOD 400H); INC(j);  (*maybe j>max  *)
						ELSE
							str[j] := CHR(x); INC(j); 
						END;
						state := 0;
					ELSE buf[n] := ch; INC(n);
					END;
				ELSE(*multi illegal bytes*)
					res := IllegalBytes; 
					FOR x := 0 TO n - 1 DO 
						IF j <= max THEN str[j] := CHR(0DC00H + ORD(buf[x])); INC(j); END; (* DCXX  *)
					END;
					DEC(i);  (* not consume illegal tail byte *)
					state := 0;
				END;
			END;
			INC(i); ch := utf8[i];
		END;
		
		IF state # 0 THEN 
			res := IllegalBytes;         (*unfinished multi legal bytes ==>illegal bytes  *)
			FOR x := 0 TO n - 1 DO 
				IF j <= max THEN str[j] := CHR(0DC00H + ORD(buf[x])); INC(j); END; (* DCXX  *)
			END; 
		END;
		IF(ch # 0X)OR(j > max) THEN truncated := TRUE; j := max; END;
		str[j] := 0X; 
		
		IF truncated THEN INC(res, Truncated); END; 
		IF surrogated THEN INC(res, Surrogated); END;
	END Utf8ToString2;
	
	(* short decode:  PEP383 version, stop when find a illegal byte  *)
	PROCEDURE Utf8ToString2Short* (IN utf8: ARRAY OF SHORTCHAR; OUT str: ARRAY OF CHAR; OUT res: INTEGER);
		VAR 
			x, i, j, max: INTEGER;
			state: INTEGER; (*  e mm s tt          e:err     m:mode    s:surrogate   t:tail bytes *)
			ch: SHORTCHAR;  
			surrogated, truncated: BOOLEAN;
	
		VAR (*PEP383*)
			buf: ARRAY 3 OF SHORTCHAR; (* buffered valid bytes*)
			n: INTEGER;

	BEGIN 
		res := 0; truncated := FALSE; surrogated := FALSE; 
		max := LEN(str) - 1; j := 0;
		ch := utf8[0]; i := 0; state := 0;
		WHILE (ch # 0X) & (j < max) DO 
			IF state = 0 THEN 
				buf[0] := ch; n := 1;
				CASE ch OF 
				| 1X(*0X*)..7FX: (* 1 bytes*)
					str[j] := ch; INC(j);
				| 0C2X..0DFX: (* 2 bytes *)
					x := ORD(ch) - 192; state := 1;             (* 0 00 0 01*) (* e=0 m=0 s=0 t=1  *)
				| 0E0X..0EFX: (* 3 bytes*)
					x := ORD(ch) - 224; 
					IF ch = 0E0X THEN state := 10;           (*0 01 0 10   *) (* e=0 m=1 s=0 t=2  *)
					ELSIF ch = 0EDX THEN state := 18;    (*0 10 0 10  *) (* e=0 m=2 s=0 t=2  *)
					ELSE state := 2;                                    (*0 0 0 10  *) (* e=0 m=0 s=0 t=2  *)
					END;
				| 0F0X..0F4X: (*4 bytes *)
					x := ORD(ch) - 240; 
					IF ch = 0F0X THEN state := 15;           (*0 01 1 11 *) (* e=0 m=1 s=1 t=3  *)
					ELSIF ch = 0F4X THEN state := 23;    (*0 10 1 11 *)  (* e=0 m=2 s=1 t=3  *)
					ELSE state := 7;                                   (* 0 00 1 11 *) (* e=0 m=0 s=1 t=3  *)
					END;
				ELSE (* single illegal byte *)
					res := IllegalBytes; RETURN;
				END;
			ELSE 
				CASE state OF
				| 10: IF (ch < 0A0X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 18: IF (ch < 080X)OR(ch > 09FX) THEN INC(state, 32); END;
				| 15: IF (ch < 090X)OR(ch > 0BFX) THEN INC(state, 32); END;
				| 23: IF (ch < 080X)OR(ch > 08FX) THEN INC(state, 32); END;
				ELSE 
					IF (ch < 080X)OR(ch > 0BFX) THEN INC(state, 32); END;
				END; 
				IF state < 32 THEN 
					DEC(state); x := x * 64 + ORD(ch) - 128; 
					IF state MOD 4 = 0 THEN 
						IF ODD(state DIV 4) THEN surrogated := TRUE; 
							DEC(x, 10000H);
							str[j] := CHR(0D800H + x DIV 400H); INC(j);
							str[j] := CHR(0DC00H + x MOD 400H); INC(j);  (*maybe j>max  *)
						ELSE
							str[j] := CHR(x); INC(j); 
						END;
						state := 0;
					ELSE buf[n] := ch; INC(n);
					END;
				ELSE(*multi illegal bytes*)
					res := IllegalBytes; RETURN;
				END;
			END;
			INC(i); ch := utf8[i];
		END;
		
		IF state # 0 THEN res := IllegalBytes; RETURN; END;        
		IF(ch # 0X)OR(j > max) THEN res := Truncated; RETURN; END;
		str[j] := 0X; 
		IF surrogated THEN res := Surrogated; END;
	END Utf8ToString2Short;
	
	(* encode PEP383 style,DCXX encode to XX ,the illegal byte decoded by Utf8ToString2 proc   can be recovered correctly  *)
	PROCEDURE StringToUtf82* (IN str: ARRAY OF CHAR; OUT utf8: ARRAY OF SHORTCHAR; OUT res: INTEGER);
		CONST IllegalChars = 80000000H; Truncated = 1;
		VAR i, j, val, max: INTEGER;
	BEGIN res := 0;
		i := 0; j := 0; max := LEN(utf8) - 4;
		WHILE (str[i] # 0X) & (j < max) DO
			val := ORD(str[i]); INC(i);
			IF val < 128 THEN
				utf8[j] := SHORT(CHR(val)); INC(j)
			ELSIF val < 2048 THEN
				utf8[j] := SHORT(CHR(val DIV 64 + 192)); INC(j);
				utf8[j] := SHORT(CHR(val MOD 64 + 128)); INC(j)
			ELSIF (0DC80H <= val) &(val < 0E000H)THEN (* PEP 383  illegalbyte *)
				res := IllegalChars;
				utf8[j] := SHORT(CHR(val MOD 100H)); INC(j);	
			ELSIF (0D800H <= val) &(val < 0DC00H) & (0DC00X >= str[i])&(str[i] < 0E000X)THEN  (* surrogate *)
				val := (val - 0D800H) * 400H + (ORD(str[i]) - 0DC00H) + 10000H;
				utf8[j] := SHORT(CHR(0F0H + val DIV 40000H)); INC(j);
				utf8[j] := SHORT(CHR(80H + val DIV 4096 MOD 64)); INC(j);
				utf8[j] := SHORT(CHR(80H + val DIV 64 MOD 64)); INC(j);
				utf8[j] := SHORT(CHR(80H + val MOD 64)); INC(j);
				INC(i); (* two chars *)
			ELSE(**)
				utf8[j] := SHORT(CHR(val DIV 4096 + 224)); INC(j); 
				utf8[j] := SHORT(CHR(val DIV 64 MOD 64 + 128)); INC(j);
				utf8[j] := SHORT(CHR(val MOD 64 + 128)); INC(j)
			END;
		END;
		utf8[j] := 0X;
		IF str[i] # 0X THEN INC(res, 1) (*Truncated*) END
	END StringToUtf82;
	
	
	
	

END CpcMyUtf8.

luowy
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Please, Doug, make voting for choose between luowy and Josef T. versions.

The draft of the options is:

- we should adopt Josef 15% faster version of Utf8 converter with simple format check
- we should adopt WenYing version of Utf8 converter with well format check according Unicode 7.0 standard
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

> Now in UTF-16 it used for surrogates for coding values greater than U+FFFF.

UTF-16 is not relevant for us. We are using UTF-8.
The wikipedia for UTF-8 says about invalid code points:
"Whether an actual application should do this is debatable, as ...".
For our purposes it is completely clear that it does not give any sense but only adds complexity and runtime without any benefit.
If we find out that invalid Unicode code points are indeed a problem, this must be treated as a separate issue and
it must be fixed in much more places than only in the Utf8 conversion.

> we spend too much time on this issue.I hope we can finish it as soon as possible and go forward.

I agree. Doug, could you please set up the vote?

- Josef
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

In order to prepare for the upcoming vote I have put 1 big and some minor updates into the #19 branch.
The big update is the Utf8ToString conversion, which is now aligned with Helmut's CPC 1.7 rc4.
It does a format check but not a contents check. We have to vote about this, of course, but
now you can at least see how this version looks like and you can compare it more easily with
alternative proposals.

The minor updates are described in the commit message.
See https://github.com/BlackBoxCenter/black ... b6916b73fb
or http://redmine.blackboxframework.org/pr ... b6916b73fb.

- Josef
luowy
Posts: 234
Joined: Mon Oct 20, 2014 12:52 pm

Re: Issue #19: Unicode for Component Pascal identifiers

Post by luowy »

Josef,

according to the RFC 3629, I clean it up to a table:

Code: Select all

---------------------------------------------------------------------------------------------------------------------	
	UTF8-octets = (UTF8-char)*
   UTF8-char   = UTF8-1 | UTF8-2 | UTF8-3 | UTF8-4

   UTF8-1      = 00-7F                
   UTF8-2      = C2-DF UTF8Tail

   UTF8-3      = E0 A0-BF UTF8Tail |        
                 E1-EC (UTF8Tail)*2 |       
                 ED 80-9F UTF8Tail |        
                 EE-EF (UTF8Tail)*2


   UTF8-4      = F0 90-BF (UTF8Tail)*2 |    
                 F1-F3 (UTF8Tail)*3 |
                 F4 80-8F (UTF8Tail)*2      

   UTF8Tail    = 80-BF
-----------------------------------------------------------

I suggest a candidate proc base your version,it is efficient and follow the uft8 stadard.

Code: Select all

	PROCEDURE Utf8ToString* (IN in: ARRAY OF SHORTCHAR; OUT out: ARRAY OF CHAR; OUT res: INTEGER);
		VAR i, j, val, max: INTEGER; ch, ch0: SHORTCHAR;
	BEGIN
		ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
		WHILE (ch # 0X) & (j < max) DO
			IF ch < 80X THEN           (*1 byte   00-7F *)
				out[j] := ch; INC(j)
			ELSIF ch < 0E0X THEN  (* 2 bytes  C2-DF UTF8Tail *)
				val := ORD(ch) - 192; IF val < 2 (*0*) THEN out := ""; res := 2 RETURN END; 
				ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
				val := val * 64 + ORD(ch) - 128;
				out[j] := CHR(val); INC(j)
			ELSIF ch < 0F0X THEN  (* 3 bytes  *)
				val := ORD(ch) - 224; ch0 := ch; ch := in[i]; INC(i); 
				IF (ch0 = 0E0X)&(ch >= 0A0X)&(ch <= 0BFX) OR (ch0 = 0EDX)& (ch >= 80X)&(ch <= 9FX) 
					OR (ch >= 80X)&(ch <= 0BFX) THEN val := val * 64 + ORD(ch) - 128;
				ELSE out := ""; res := 2 RETURN 
				END;
				ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
				val := val * 64 + ORD(ch) - 128;
				out[j] := CHR(val); INC(j)
			ELSE(* 4 bytes *)
				out := ""; res := 2 RETURN
			END;
			ch := in[i]; INC(i)
		END;
		out[j] := 0X;
		IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
	END Utf8ToString;
	
luowy
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Luowy, your new version based on Josef's also fail to detect bad-format string: 0EDX 0A1X 8CX 0EDX 0BEX 0B4X 0X

Code: Select all

Josef Templ version:
68.7 ms
12.4 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $FALSE
Truncated:  $TRUE

Alexandr Shiryaev version:
80.3 ms
15.8 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $TRUE
Truncated:  $TRUE

LuoWy full decode:  PEP383 version
83.3 ms
14.7 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $TRUE
Truncated:  $TRUE

LuoWy version based on Josef's:
65.7 ms
12.3 ms
Incorrect input 1:  $TRUE
Incorrect input 2:  $FALSE
Truncated:  $TRUE
The test for all versions I have attached to this post.

I have applied your full decode PEP383 version to branch #19.
Now we can see the differences between full check and simple check realizations.

Both versions were built by our pipeline and are working well:
- simple format check blackbox-1.7-a1.026.zip
- full format check blackbox-1.7-a1.027.zip
Attachments
Utf8TestNew.txt
(19.63 KiB) Downloaded 678 times
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Ivan Denisov wrote:Please, Doug, make voting for choose between luowy and Josef T. versions.

The draft of the options is:

- we should adopt Josef 15% faster version of Utf8 converter with simple format check
- we should adopt WenYing version of Utf8 converter with well format check according Unicode 7.0 standard
Are we agreed that the choice is between luowys (which version?) and Josef's?
Post Reply