issue-#19: Unicode for Component Pascal identifiers

Merged to the master branch
luowy
Posts: 234
Joined: Mon Oct 20, 2014 12:52 pm

Re: Issue #19: Unicode for Component Pascal identifiers

Post by luowy »

Ivan Denisov wrote:Luowy, your new version based on Josef's also fail to detect bad-format string: 0EDX 0A1X 8CX 0EDX 0BEX 0B4X 0X
sorry,it has a bug,here is the fixup:

Code: Select all

	PROCEDURE Utf8ToString* (IN in: ARRAY OF SHORTCHAR; OUT out: ARRAY OF CHAR; OUT res: INTEGER);
		VAR i, j, val, max: INTEGER; ch, ch0: SHORTCHAR;
	BEGIN
		ch := in[0]; i := 1; j := 0; max := LEN(out) - 1;
		WHILE (ch # 0X) & (j < max) DO
			IF ch < 80X THEN           (*1 byte   00-7F *)
				out[j] := ch; INC(j)
			ELSIF ch < 0E0X THEN  (* 2 bytes  C2-DF UTF8Tail *)
				val := ORD(ch) - 192; IF val < 2 (*0*) THEN out := ""; res := 2 RETURN END; 
				ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
				val := val * 64 + ORD(ch) - 128;
				out[j] := CHR(val); INC(j)
			ELSIF ch < 0F0X THEN  (* 3 bytes 1110xxxx 10xxxxxx 10xxxxxx *)
				val := ORD(ch) - 224; ch0 := ch; ch := in[i]; INC(i); 
				IF (ch0 = 0E0X)&(ch >= 0A0X)&(ch <= 0BFX) OR (ch0 = 0EDX)& (ch >= 80X)&(ch <= 9FX) 
					OR (ch0#0E0X)&(ch0#0EDX)&(ch >= 80X)&(ch <= 0BFX) THEN val := val * 64 + ORD(ch) - 128;
				ELSE out := ""; res := 2 RETURN 
				END;
				ch := in[i]; INC(i); IF (ch < 80X) OR (ch >= 0E0X) THEN out := ""; res := 2 RETURN END;
				val := val * 64 + ORD(ch) - 128;
				out[j] := CHR(val); INC(j)
			ELSE(* 4 bytes *)
				out := ""; res := 2 RETURN
			END;
			ch := in[i]; INC(i)
		END;
		out[j] := 0X;
		IF ch = 0X THEN res := 0 (*ok*) ELSE res := 1 (*truncated*) END
	END Utf8ToString;
	

DGDanforth wrote: Are we agreed that the choice is between luowys (which version?) and Josef's?
I prefer this version. just now we no need to use the char beyond 4bytes in identifier,this version is good enough for kernel.
the full version can be improved like python style(has multi mode in one function) for string lib use.that is another subject.


luowy
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Now, it passes the test and seems a bit faster. I improved test, now it checks all UTF8-3 bad-format chars.

Code: Select all

Josef Templ version:
69.2 ms
12.6 ms
Incorrect input:  $FALSE
Truncated:  $TRUE

LuoWy version based on Josef's:
66.6 ms
12.4 ms
Incorrect input:  $TRUE
Truncated:  $TRUE
I do not want to add code to branch until voting. It is already a bit rambled.
Attachments
Utf8TestNew2.txt
(5.69 KiB) Downloaded 444 times
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

DGDanforth wrote:
Ivan Denisov wrote:Please, Doug, make voting for choose between luowy and Josef T. versions.

The draft of the options is:

- we should adopt Josef 15% faster version of Utf8 converter with simple format check
- we should adopt WenYing version of Utf8 converter with well format check according Unicode 7.0 standard
Are we agreed that the choice is between luowys (which version?) and Josef's?
Last Luowy's version seems ok and pass tests. So we can choose between Josef's version and Luowe's version based on Josef's.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Now, do we need to make voting for adopting this feature in general?
The previous voting was for choosing the encoder realization and not for adopting the Unicode for Component Pascal identifiers. Am I wrong?
Bernhard
Posts: 68
Joined: Tue Sep 17, 2013 6:56 am
Location: Munich, Germany

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Bernhard »

I have edited/deleted the contents of my post, it was a misunderstanding on my side ...
Last edited by Bernhard on Fri Dec 05, 2014 9:43 am, edited 1 time in total.
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Josef Templ »

I have committed the latest changes to the issue-#19 branch.
- Utf8-conversion without checking for illegal Unicode characters
- luowy's optimzation for error detection in 3-byte sequences
- Helmut's change of Kernel.LoaderHook.ThisMod parameter to ARRAY OF CHAR because
this eliminates several Utf8-conversions
The branch is in good shape by now and from my point of view ready for merging.
Ivan Denisov wrote:Now, do we need to make voting for adopting this feature in general?
The previous voting was for choosing the encoder realization and not for adopting the Unicode for Component Pascal identifiers. Am I wrong?
Right, we need an extra voting. Hopefully a fast one.

- Josef
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Ah, That was not my understanding. The title of the issue is "Unicode for Component Pascal identifiers". The subtle difference between that and "encoder realization" escapes me.
-Doug

Josef Templ wrote:I have committed the latest changes to the issue-#19 branch.
- Utf8-conversion without checking for illegal Unicode characters
- luowy's optimzation for error detection in 3-byte sequences
- Helmut's change of Kernel.LoaderHook.ThisMod parameter to ARRAY OF CHAR because
this eliminates several Utf8-conversions
The branch is in good shape by now and from my point of view ready for merging.
Ivan Denisov wrote:Now, do we need to make voting for adopting this feature in general?
The previous voting was for choosing the encoder realization and not for adopting the Unicode for Component Pascal identifiers. Am I wrong?
Right, we need an extra voting. Hopefully a fast one.

- Josef
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Doug, please, make voting!

The version to try is here:
blackbox-1.7-a1.028.zip
blackbox-1.7-a1.028-setup.exe
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19: Unicode for Component Pascal identifiers

Post by DGDanforth »

Ivan,
I can't make a vote since I don't know what we are voting for!
To me we are finished. We already voted for an passed the issue.
-Doug
Ivan Denisov wrote:Doug, please, make voting!

The version to try is here:
blackbox-1.7-a1.028.zip
blackbox-1.7-a1.028-setup.exe
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19: Unicode for Component Pascal identifiers

Post by Ivan Denisov »

DGDanforth wrote:I am creating this poll simply trusting and hoping that you all know what you are doing.
Ivan Denisov wrote:We Just choosing between realization of Utf8ToString converter in Kernel.
We voted just about converter. Doug, we need to make new voting!

- adopt Unicode for Component Pascal identifiers
- do not adopt Unicode for Component Pascal identifiers
- abstain
Post Reply