issue-#19: Unicode for Component Pascal identifiers

Ivan Denisov · Post by **Ivan Denisov** » Tue Nov 04, 2014 6:30 pm

Josef Templ wrote:I cannot read this weird grammer notation.
According to Wikipedia it is a correct Utf-8 sequence
because the tail has the right number of bytes and all tail bytes have high bits 10.

If we will make it like wikipedia tells, there can be problems in future, because people will use this converter for various purposes.

The RFC 3629 is today de facto standart for UTF-8. There is links from Wiki on it also.

The unicode.org warning about our case and "prohibit encoding of certain invalid characters".

http://www.unicode.org/faq/utf_bom.html#utf8-1 wrote:A: UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding Forms ” in The Unicode Standard. See, in particular, Table 3-6 UTF-8 Bit Distribution and Table 3-7 Well-formed UTF-8 Byte Sequences, which give succinct summaries of the encoding form. Make sure you refer to the latest version of the Unicode Standard, as the Unicode Technical Committee has tightened the definition of UTF-8 over time to more strictly enforce unique sequences and to prohibit encoding of certain invalid characters. There is an Internet RFC 3629 about UTF-8. UTF-8 is also defined in Annex D of ISO/IEC 10646. See also the question above, How do I write a UTF converter?

Josef Templ · Post by **Josef Templ** » Wed Nov 05, 2014 10:51 am

Ivan, instead of pointing us to a ton of heavy-weight documents,
please tell us what the problem is in simple words.
Currently I don't have the time to figure this out in detail.

- Josef

Zinn · Post by **Zinn** » Wed Nov 05, 2014 9:25 pm

I found the following specification for implementation of Utf8 to String converter:

None of the UTFs can generate every arbitrary byte sequence.
For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2.
A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated.
When faced with this illegal byte sequence while transforming or interpreting,
a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error:
for example, either signaling an error, filtering the byte out,
or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER).
In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

Here is my solution which obeys the specification above:

StdCoder.Decode ..,, ..Mm....3Qw7uP5PRPPNR9Rbf9b8R79FTvMf1GomCrlAy2xhX,Cb2x
hXhC6FU1xhiZiVBhihgmRiioedhgrZcZRiXFfaqmSrtuGfa4700zdGrr8rmCLLCJuyKtYcZRiX
7.2.s,M5N.,k,5TWyql.bnayKmKKqGomC5XzET1.PuP.MHT9N9ntumaU2,CJuyKtQC98P9PP7O
NbXmb.2.Qh9k2kli.,6.,U08J99SdfJHPNjvQCJuGKfaqmY6MwdONl1QCh0708T,U..w.gV9U.
2U18J99SqorGqmQCbWBxhYFWUl1UnNHEWUmr.6.wW7k5k5a.,6.,.pT.cU.ktAcoZimBhWhioh
gnZcZRCY.2.w22U.EBU.U,U.I3VuI3k.E65.0E69.1cUZT1E.s7E.c4E.k.Ue.0.,6Y1.i.8E,
9z4U.EKE.EBU.U,.J,U.Yr0E.gV.k.i5CE,9z4E.0.D,,6.I16.z50.J,U.2GkW0.7cUZT16.,
Ul.,UO.,szE.c8U,UE0,s4s.f.1MUM,3gwP.,6..EBU.kzF.0EJ.,U.2GE00.1M8M,7.O.Ck3k
.0,,M,9.16Bs.D,W3Q,63M,7.1MBs.3,1M.E0U4V1I2AU,.9.1sK,U1Q,AUDU1.1c1s.UuU.A0
s3szFd8PcHTfP996TfN,7RFPN,NJdeFb965PMR96DPNRPNZPMdPN,NNhPNZPS,NMZfMH9RZPMZ
PS,dMn9R996bPNXPR9fP5PNR76BuPZ9699S1PPV9P999,NOR96f8JBO9l769fR91UjpAEnyKtq
KE4nM0HwWLwWbm2YhhinZiUIgZ3YaxhgZhjxiZZgUwidZic3YV3YWBD.UmVk2js7SIai2Y,3Yn
hAUUQipRgc3YVRiUYblAakUsdA,7Al9Sl9ScAx76HvQ,NON9P9vN19PN761fP796PPRbXihgqh
gm3YWhgUwgZpBEmuGESJoKKr0Gn4qlKKm006RFPOb96.0GlaLusQ.wicBhghgUYimBgiRiaxhm
hhdphb3YjJiUAhiZiZJikJiZZidphbZZUAgUgeI,5vPRfNTfQPPMR9R,7QZvP5PNbvQ,NPfvQY
imhgVZiUYichAaKtCLu0Gla54nMUsdA,NMb961fP,NON9PUohgmhhdphVZidxhi3YZJimxhmJb
UogjJiU.UZBho3hZJiUQidxgiBggBhixgUAgiNNZ1N76BPON9R9fQHfPD96d9O9Hla5yquGLK0
mr8LE8rm0LtKqtKKrGrouqn.a5SroG51nq4Ktiqm8LECruC4cFBeF786FdI98INOE5OFPOFR8J
,tEFOEZOE58J9eIHd9,NGR96d9OYhVZiohgm3Yoxij3YXBgnhgnZZUAho3YrBhgZhUQgj7RHfP
fPN,7QZvPkouqn0mkGLEG5CbXxhiZgUIgtVk2js7ScARN1JN8PM1H6IZuH5OF7OJZOF,NJdfNl
7JTvIdfQHfPDf8,78HeH,NOR96p761eIZOEn86TeF,tIFuHZ8J58G1mR0mbKJe0mrKLu0GR0mU
.sEFOE.8rmCLR0mYuIeKoXKIdamRqk2aEf4Id0momGEeKK0Gv4KqmGEqqkWLEeHEaIb.r7659O
,dC,tIF0UBgV7AVEJ8.Oor85dPFZfQTnRqk2aEVKoXaIb0mrKLuiJpqJEenS0GMOIXOYMRbUAd
CRccIhdQbUIiZRiUIbx2YmQbU2Ze2YdZBUW,bns.emIqk2akWuIW0GX..qU0h6gV7AVmhgn322
avgV7AVd3Yug52Ye3YugbUUhBgslS0GaKIbWmrKLuaGEqGE4nRqk2we6BdAhcU2ZdphPBhR3YX
2Yk2fd2Ya2YcIhUYbUghVN8,7FTO1HM0HsMF96pNDaKri3r76HeHAhdQbBA,aIX0mlWKEmHE0H
Q0Hg0GeWoWuIEy4r0sMFvCUeN1AV3ZdHBd43YX,CIMWJEG3ocj,.a...98Al86d0PM0ak2Orkm
KEenSwdGZccQgcBZUg3aHNin4a.HsMUd,qk2M0HWcQgc3YygbU2bkM8O05XwkVM8,7JFOFU7MG
RuEFNO2YBAV7AV7cR.hPMN96J76h7B,t8,tHZ8FM8q0Z7CrN1HU7sPf9Rr0sEFeIFdR19PHtC,
NGcOU7g6KIEOor.EEKIbGoRqk2.H0U43aM3YI,M0h1pVD,UmIaoQbBA,akl.H1.HMGB00GT.o2
ESsEEe.A,HeHAB.HEv.O5,d8..HlKA4.Hkl.H1.HkYO22Yy.c7Uw.6J.a.HeH.k2cRcCoCI3.U
dM9.M0wB.Q6Ud.IBk2g6K2...akW........KIbQ5wB.2aMRbBkYOYdphPFEUU.,NGRuE8rmCr
I0GIeGEGLtKLrCqkGrmGKEemI0mWuIWqk2KIb2YJZia,UvgV0CyIhACoruKu8rrmKqKKtCLLCJ
uQcoJigZcZRiX3Ulb8..umVyKr.YcZRiX3.5011ZvQUm,..Unp3.6F6.Z50.G,0.aU.6.,..1c
UXDJcIf9P9fQbf9bWGhigFWE.4Te.sQRdIf9P9HWE.8z,E.0.LJ.w.QI2U.sU.ktumdsIdPSNP
N7ONbH.4D.o3aLq.,cwD.0.E2Eh2.0.32.oZ,ZCZ16F9vQ0U2U...G00k.0.0.0mFf3,E.mLT5
UTyB4.4.0E.cUZT1E..UO.,.1.eWM2y,.,6Y1.0.UA2Tm.mmBjZ92T,eUXDF.sET1.UEU...W.
0.A,,Uohfn76Dtc0MyfU.az86x,9O0OJWN..3l,...
--- end of encoding ---

Josef Templ · Post by **Josef Templ** » Thu Nov 06, 2014 8:53 pm

Helmut, your version has a serious bug.
It may return truncation in case of a format error.
Also there is no decent recovery after a format error.
I tried to avoid the problem of recovery after a format error
by returning immediately.

- Josef

Josef Templ · Post by **Josef Templ** » Fri Nov 07, 2014 12:14 pm

Ivan, if your concern is about using 'invalid' 16-bit Unicodes, this can be ignored.
Component Pascal does not do any checks when converting a character code to a CHAR.
Why should the Utf8 converter be more strict than the Component Pascal language?
I am convinced that it is enough for us to follow the definition in Wikipedia
and do a simple and efficient conversion for all 16-bit characters.
Note that the conversion is used heavily by the compiler. You don't want to
introduce meaningless checks there and slow down the compiler.

- Josef

Josef Templ · Post by **Josef Templ** » Fri Nov 07, 2014 12:19 pm

Helmut, your version has another drawback.
It optimizes the error detection instead of optimizing the successful cases.
Look at "ELSIF ch < 0C0X". In my version this is only executed if a 2-byte sequence is detected.
In your version this is always executed and thereby slows down the conversion of 3-byte sequences.

To summarize the current state, IMHO neither Ivan's nor your version is
an improvement over my version.

- Josef

Ivan Denisov · Post by **Ivan Denisov** » Fri Nov 07, 2014 2:35 pm

Josef, valid UTF-8 should be:

Code: Select all

   UTF8-octets = *( UTF8-char )
   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
   UTF8-1      = 00X-7FX
   UTF8-2      = C2X-DFX UTF8-tail
   UTF8-3      = E0X A0X-BFX UTF8-tail / E1X-ECX 2( UTF8-tail ) / EDX 80X-9FX UTF8-tail / EEX-EFX 2( UTF8-tail )
   UTF8-4      = F0X 90X-BFX 2( UTF8-tail ) / F1X-F3X 3( UTF8-tail ) /  F4X 80X-8FX 2( UTF8-tail )
   UTF8-tail   = 80X-BFX

Your algorithm does not detect some of this cases.

Alexander make big work about this for other project. I am suggesting to use his algorithm and without out := in$.

Ivan Denisov · Post by **Ivan Denisov** » Fri Nov 07, 2014 2:39 pm

Josef Templ wrote:Component Pascal does not do any checks when converting a character code to a CHAR.
Why should the Utf8 converter be more strict than the Component Pascal language?

NO!! This converter exported, so it can be used not only for BlackBox inside work, but also in module Strings and for hundreds of other tasks. UTF-8 is everywhere. So we should think about this.

Josef Templ · Post by **Josef Templ** » Sat Nov 08, 2014 7:18 am

Ivan, please keep it simple. You are inventing problems that don't exist.
Our Utf-8 converters convert any 'Valid' Component-Pascal string into Utf-8 AND back.
In Component Pascal a string is valid if it is 0X terminated.
Since Component-Pascal does not restrict the character codes, why should the Utf-8 converter?
With your approach you end up for some strings in a legal StringToUtf8 conversion that
cannot be converted back by Utf8ToString. This is really strange
and the alternative is so obvious and so simple.

- Josef

Zinn · Post by **Zinn** » Sat Nov 08, 2014 7:26 am

I decided to delete all error detection except buffer overflow (res = 1).
Because:
- Format errors inside identifier does not occur
- The procedure is much faster
- The procedure is easier to understand
Back to the root is the best solution.

BlackBox Framework Center

issue-#19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers

Re: Issue #19: Unicode for Component Pascal identifiers