Page 10 of 11

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 2:33 am
by luowy
Ivan Denisov wrote:Luowy, your new version based on Josef's also fail to detect bad-format string: 0EDX 0A1X 8CX 0EDX 0BEX 0B4X 0X
sorry,it has a bug,here is the fixup:

Code: Select all

	PROCEDURE Utf8ToString* (IN in: ARRAY OF SHORTCHAR; OUT out: ARRAY OF CHAR; OUT res: INTEGER);
		VAR i, j, val, max: INTEGER; ch, ch0: SHORTCHAR;
	BEGIN
		ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
		WHILE (ch # 0X) & (j < max) DO
			IF ch < 80X THEN           (*1 byte   00-7F *)
				out[j] := ch; INC(j)
			ELSIF ch < 0E0X THEN  (* 2 bytes  C2-DF UTF8Tail *)
				val := ORD(ch) - 192; IF val < 2 (*0*) THEN out := ""; res := 2 RETURN END; 
				ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
				val := val * 64 + ORD(ch) - 128;
				out[j] := CHR(val); INC(j)
			ELSIF ch < 0F0X THEN  (* 3 bytes 1110xxxx 10xxxxxx 10xxxxxx *)
				val := ORD(ch) - 224; ch0 := ch; ch := in[i]; INC(i); 
				IF (ch0 = 0E0X)&(ch >= 0A0X)&(ch <= 0BFX) OR (ch0 = 0EDX)& (ch >= 80X)&(ch <= 9FX) 
					OR (ch0#0E0X)&(ch0#0EDX)&(ch >= 80X)&(ch <= 0BFX) THEN val := val * 64 + ORD(ch) - 128;
				ELSE out := ""; res := 2 RETURN 
				END;
				ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
				val := val * 64 + ORD(ch) - 128;
				out[j] := CHR(val); INC(j)
			ELSE(* 4 bytes *)
				out := ""; res := 2 RETURN
			END;
			ch := in[i]; INC(i)
		END;
		out[j] := 0X;
		IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
	END Utf8ToString;
	

DGDanforth wrote: Are we agreed that the choice is between luowys (which version?) and Josef's?
I prefer this version. just now we no need to use the char beyond 4bytes in identifier,this version is good enough for kernel.
the full version can be improved like python style(has multi mode in one function) for string lib use.that is another subject.


luowy

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 4:56 am
by Ivan Denisov
Now, it passes the test and seems a bit faster. I improved test, now it checks all UTF8-3 bad-format chars.

Code: Select all

Josef Templ version:
69.2 ms
12.6 ms
Incorrect input:  $FALSE
Truncated:  $TRUE

LuoWy version based on Josef's:
66.6 ms
12.4 ms
Incorrect input:  $TRUE
Truncated:  $TRUE
I do not want to add code to branch until voting. It is already a bit rambled.

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Wed Nov 26, 2014 5:35 am
by Ivan Denisov
DGDanforth wrote:
Ivan Denisov wrote:Please, Doug, make voting for choose between luowy and Josef T. versions.

The draft of the options is:

- we should adopt Josef 15% faster version of Utf8 converter with simple format check
- we should adopt WenYing version of Utf8 converter with well format check according Unicode 7.0 standard
Are we agreed that the choice is between luowys (which version?) and Josef's?
Last Luowy's version seems ok and pass tests. So we can choose between Josef's version and Luowe's version based on Josef's.

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Fri Dec 05, 2014 5:17 am
by Ivan Denisov
Now, do we need to make voting for adopting this feature in general?
The previous voting was for choosing the encoder realization and not for adopting the Unicode for Component Pascal identifiers. Am I wrong?

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Fri Dec 05, 2014 8:12 am
by Bernhard
I have edited/deleted the contents of my post, it was a misunderstanding on my side ...

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Fri Dec 05, 2014 8:53 am
by Josef Templ
I have committed the latest changes to the issue-#19 branch.
- Utf8-conversion without checking for illegal Unicode characters
- luowy's optimzation for error detection in 3-byte sequences
- Helmut's change of Kernel.LoaderHook.ThisMod parameter to ARRAY OF CHAR because
this eliminates several Utf8-conversions
The branch is in good shape by now and from my point of view ready for merging.
Ivan Denisov wrote:Now, do we need to make voting for adopting this feature in general?
The previous voting was for choosing the encoder realization and not for adopting the Unicode for Component Pascal identifiers. Am I wrong?
Right, we need an extra voting. Hopefully a fast one.

- Josef

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Sat Dec 06, 2014 6:48 am
by DGDanforth
Ah, That was not my understanding. The title of the issue is "Unicode for Component Pascal identifiers". The subtle difference between that and "encoder realization" escapes me.
-Doug

Josef Templ wrote:I have committed the latest changes to the issue-#19 branch.
- Utf8-conversion without checking for illegal Unicode characters
- luowy's optimzation for error detection in 3-byte sequences
- Helmut's change of Kernel.LoaderHook.ThisMod parameter to ARRAY OF CHAR because
this eliminates several Utf8-conversions
The branch is in good shape by now and from my point of view ready for merging.
Ivan Denisov wrote:Now, do we need to make voting for adopting this feature in general?
The previous voting was for choosing the encoder realization and not for adopting the Unicode for Component Pascal identifiers. Am I wrong?
Right, we need an extra voting. Hopefully a fast one.

- Josef

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Sun Dec 07, 2014 11:59 am
by Ivan Denisov
Doug, please, make voting!

The version to try is here:
blackbox-1.7-a1.028.zip
blackbox-1.7-a1.028-setup.exe

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Sun Dec 07, 2014 11:22 pm
by DGDanforth
Ivan,
I can't make a vote since I don't know what we are voting for!
To me we are finished. We already voted for an passed the issue.
-Doug
Ivan Denisov wrote:Doug, please, make voting!

The version to try is here:
blackbox-1.7-a1.028.zip
blackbox-1.7-a1.028-setup.exe

Re: Issue #19: Unicode for Component Pascal identifiers

Posted: Mon Dec 08, 2014 3:20 am
by Ivan Denisov
DGDanforth wrote:I am creating this poll simply trusting and hoping that you all know what you are doing.
Ivan Denisov wrote:We Just choosing between realization of Utf8ToString converter in Kernel.
We voted just about converter. Doug, we need to make new voting!

- adopt Unicode for Component Pascal identifiers
- do not adopt Unicode for Component Pascal identifiers
- abstain