Utf8ToString converter for Issue #19

Issue #19: Unicode for Component Pascal identifiers - luowy or Josef's solution

ABSTAIN
5
45%
luowy's solution
2
18%
Josef's solution
4
36%
 
Total votes: 11

Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Ivan Denisov »

DGDanforth wrote:
Ivan Denisov wrote:http://forum.blackboxframework.org/whod ... php?id=154

Did not vote
OberonCore
ReneK
akastargazer
warnersoft
English translation: Those who have not voted yet.
Thank you, Doug, I have fixed my script: http://forum.blackboxframework.org/whod ... php?id=154
OberonCore
Posts: 31
Joined: Tue Sep 17, 2013 10:30 am
Location: Russia, Orel
Contact:

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by OberonCore »

There is one more solution. It's more general and can be simpilified/modified as needed. It's used, for example, in omcUtf8Conv (http://forum.oberoncore.ru/viewtopic.php?t=4633).

Code: Select all

	PROCEDURE Utf8ToUcs2* (IN inBuf: ARRAY OF SHORTCHAR; VAR inPos: INTEGER; inLen: INTEGER; OUT outBuf: ARRAY OF CHAR; VAR outPos: INTEGER; outLen: INTEGER);
		VAR	inp, outp: INTEGER; st, ch, c: INTEGER; char: CHAR;
	BEGIN
		ASSERT((0 <= inLen) & (0 <= inPos) & (inPos + inLen <= LEN(inBuf)), 20);
		ASSERT((0 <= outLen) & (0 <= outPos) & (outPos + outLen <= LEN(outBuf)), 21);
		inp := inPos; outp := outPos;
		IF (0 < inLen) & (0 < outLen) THEN st := 8 ELSE st := 31 END;
		LOOP IF (st IN {1..10}) & (0 < inLen) (*& (0 < outLen) *)THEN
			c := ORD(inBuf[inp]); INC(inp); DEC(inLen);
			CASE st OF
			| 8:
				IF c <= 07FH THEN
					ch := c; st := 0
				ELSIF c < 0C0H THEN
					st := 11
				ELSIF c < 0C2H THEN
					st := 10
				ELSIF c <= 0DFH THEN
					ch := ORD(BITS(c) * {0..5}); st := 1
				ELSIF c = 0E0H THEN
					ch := ORD(BITS(c) * {0..4}); st := 4
				ELSIF c <= 0ECH THEN
					ch := ORD(BITS(c) * {0..4}); st := 2
				ELSIF c = 0EDH THEN
					ch := ORD(BITS(c) * {0..4}); st := 5
				ELSIF c <= 0EFH THEN
					ch := ORD(BITS(c) * {0..4}); st := 2
				ELSIF c = 0F0H THEN
					ch := ORD(BITS(c) * {0..3}); st := 6
				ELSIF c <= 0F3H THEN
					ch := ORD(BITS(c) * {0..3}); st := 3
				ELSIF c = 0F4H THEN
					ch := ORD(BITS(c) * {0..3}); st := 7
				ELSE
					st := 11
				END
			| 4:
				IF (0A0H <= c) & (c <= 0BFH) THEN
					ch := ORD(BITS(ASH(ch, 6)) + BITS(c) * {0..6}); st := 1
				ELSE
					st := 10
				END
			| 5:
				IF (080H <= c) & (c <= 09FH) THEN
					ch := ORD(BITS(ASH(ch, 6)) + BITS(c) * {0..6}); st := 1
				ELSE
					st := 10
				END
			| 6:
				IF (090H <= c) & (c <= 0BFH) THEN
					ch := ORD(BITS(ASH(ch, 6)) + BITS(c) * {0..6}); st := 2
				ELSE
					st := 9
				END
			| 7:
				IF (080H <= c) & (c <= 08FH) THEN
					ch := ORD(BITS(ASH(ch, 6)) + BITS(c) * {0..6}); st := 2
				ELSE
					st := 9
				END
			| 1..3:
				IF (080H <= c) & (c <= 0BFH) THEN
					ch := ORD(BITS(ASH(ch, 6)) + BITS(c) * {0..6});
					DEC(st)
				ELSE
					st := 12 - st
				END
			| 9..10:
				INC(st)
			END
		ELSIF st IN {0, 11} (*& (0 < outLen) *)THEN
			IF (st = 0) & (0 <= ch) & (ch <= 0FFFFH) THEN
				char := CHR(ch)
			ELSE
				char := "?"
			END;
			outBuf[outp] := char; INC(outp); DEC(outLen);
			IF (0 < inLen) & (0 < outLen) THEN st := 8 ELSE st := 31 END
		ELSE EXIT END END;
		ASSERT((st IN {1..7, 9..10, 31}) & ((inLen = 0) OR (outLen = 0)));
		IF st IN {1..7, 9..10} THEN outBuf[outp] := "?"; INC(outp); DEC(outLen) END;
		inPos := inp; outPos := outp
	END Utf8ToUcs2;
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Now the quorum is reached. So we can stop voting and apply Josef's solution.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by DGDanforth »

Excuse me for being dense but I don't believe we every decided that having a quorum stops the vote.
It is my understanding that a necessary condition for a valid vote is a quorum of the members have voted.
That doesn't mean the voting is over.

For the current vote if the last member were to vote for luowy's solution then we would have a tie.
If at any time the number of nonvoting members can not change the result of a vote then the voting is stopped whether or not a quorum was reached (short circuit rule).

So it is my interpretation that the voting has not stopped and that we need one more vote.
Ivan Denisov wrote:Now the quorum is reached. So we can stop voting and apply Josef's solution.
-Doug
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by Ivan Denisov »

Doug, I agree with you. We can wait for warnersoft voice. Or some Abstained members can change their opinion.
warnersoft
Posts: 3
Joined: Thu Sep 26, 2013 7:35 pm

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by warnersoft »

I've been trying to follow this discussion but sadly most of this is over my head. Is the concern that a malicious developer could craft an identifier (such as a procedure name) with invalid utf-8 sequences that could be passed as a procedure parameter that could possibly allow a branch in execution to malicious code? Or is this simply allowing for development using for example Cyrillic characters in identifiers? If the former then I would vote for the extra code to catch the invalid sequences, the latter I would choose the most efficient (fastest).
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Issue #19 : Unicode for Component Pascal identifiers

Post by DGDanforth »

As Josef noted we now have a short circuit vote and so the poll is stopped with Josef's solution the chosen one.
Locked