issue-#9: adding module Characters → #19

Merged to the master branch
Bernhard
Posts: 68
Joined: Tue Sep 17, 2013 6:56 am
Location: Munich, Germany

Re: Feature #9: adding module Characters

Post by Bernhard »

Josef,

thanks a lot for the clarification.

I have a slight problem understanding the difference between ARRAY OF BYTE and ARRAY OF SHORTCHAR. I do not remember, if the special role of a formal paramater of ARRAY OF BYTE being compatible with any argument is retained in Component Pascal, if you mean that.

But the whole discussions throws us into the problems of a universal character set and I realized when re-reading the language report that CHAR is also limited to the 16 bit range of UCS-16, with which it is not possible to map the complete Unicode character set. As far as I know, A2/Aos stores UTF-8 on disk/files but maps them to UCS-32 in memory and therefore avoiding the problems with string length and allocation/length differences. I personally dislike UTF-8 coding in memory since the size requirements of a string can be much larger than its number of characters.

Allowing identifiers to be an UTF-8 encoded ARRAY OF SHORTCHAR seems to be a solution.

I fear far east people could expect us to support also their characters, although a chinese collegue ensured me that the difficulty and ambiguity in entering such characters are far more difficult as anything gained from it, so what should we do?

--
Bernhard
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Feature #9: adding module Characters

Post by Josef Templ »

bernhard wrote:Josef,

I fear far east people could expect us to support also their characters, although a chinese collegue ensured me that the difficulty and ambiguity in entering such characters are far more difficult as anything gained from it, so what should we do?

--
Bernhard
16 bit is enough. This representation is also used in Java.
The UCS-16 subset of Unicode already contains a lot of characters from
far east including Hiragana, Katakana, Hangul, and unified CJK (Chinese, Japanese, and Korean).

The characters above that support for example ancient character sets such as Maya or Egyptian hieroglyphs
or extended CJK. A large part is unused.

- Josef
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

Ivan Denisov wrote:
DGDanforth wrote:If identifiers are encoded with UTF-8 then there needs to be a wrapper around such
sequences of "characters" that knows how to deal with the UTF-8 encoding.
How is that done when the number of bytes for a character exceeds 1?
I think, that Josef gave the right explanation and answer for your question:
Josef Templ wrote:UTF-8 encoded ARRAY OF BYTE that is stored within an ARRAY OF SHORTCHAR
Yes, a little more thought on my part shows that it does not matter what the encoding is for an identifier. One only needs to compare arrays of bytes. Symbol table look up need only test for length equality and if true then test fro byte equality.

The encoding only matters when the array is displayed.

-Doug
cfbsoftware
Posts: 204
Joined: Wed Sep 18, 2013 10:06 pm
Contact:

Re: Feature #9: adding module Characters

Post by cfbsoftware »

DGDanforth wrote:Symbol table look up need only test for length equality and if true then test fro byte equality.
Unless you are storing the length separately it might be quicker just to compare the bytes as you can stop at the first mismatch. You always have to scan ALL of the bytes just to measure the length.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

cfbsoftware wrote:
DGDanforth wrote:Symbol table look up need only test for length equality and if true then test fro byte equality.
Unless you are storing the length separately it might be quicker just to compare the bytes as you can stop at the first mismatch. You always have to scan ALL of the bytes just to measure the length.
If I understand Helmut's implementation correctly then an ARRAY OF SHORTCHAR is not a null terminated string. It is a fixed length array of byte. Helmut is it 0X terminated or is it fixed length?
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

I believe I can answer my own question
I believe ascii is a subset of UTF-8.
I believe 0X is an element of ascii.
Hence null terminated strings are a valid representation in UTF-8.

Please bear with me as I think through all of this.
I am not very interested but know it is necessary and needed.
I am not particularly happy with variable length characters but
since they have been used since 1993 it shows I am out of date
by at least 20 years.

-Doug
Bernhard
Posts: 68
Joined: Tue Sep 17, 2013 6:56 am
Location: Munich, Germany

Re: Feature #9: adding module Characters

Post by Bernhard »

hmm, I'm slightly confused/unsettled.

To experiment with it, I fetched Helmuts latest Version from http://www.zinnamturm.eu/pac/B2014.0831.zip including his change log (decoded from: http://www.zinnamturm.eu/pac/B2014.0831.txt unpacked it to a separate directory and tried to enter some greek unicode pi (found in charmap.exe as Unicode code point: U+03C0 = π).

When I cut and past this character from charmap to that Blackbox instance I get a strange looking view (cut'n'pasted as attachement), which is encoded below.

I cannot find any module Characters. What am I doing wrong? Where is my error?
--
Bernhard

StdCoder.Decode ..,, ..iE....3Qw7uP5PRPPNR9Rbf9b8R79FTvMf1GomCrlAy2xhX,Cb2x
hXhC6FU1xhiZiVBhihgmRiioedhgrZcZRiXFfaqmSrtuGfa4700zdGrr8rmCLLCJuyKtYcZRiX
7.2.s,sg5.,6.5Qw7uP51QCPuP7PNN9F9vQAy1xB.gdj,UBxhYhAbf9P0G2sIdvPZntgcghghZ
cZRC8T0E.k.C.H.nj.E.cUGpmWLuOpoKqvCbHZiYpedhA704TeKKw.bHfEWUmL.6..D.5h.E.C
cIhgsNHT9N9ntQ8qorG4704D.CbB,708T1U.kKl5kG1.,6.EAF.86.QC18RdfQHfMf9R9vQ7ON
b17.,.N,,.z.,6.M.,.3OMdPMRvNWMP9.2UEC.c.M.3gwP.0..w26.M.,.PuI,tIFPNN9P,7FN
vN,dA0S45.2UE04.yzG1z48ssHpmsETfPdfQT9PNPNZvQRtIdnVGLtmKWKqtCK.4D..umVyKrG
5EWKqtCK.Q6AA.cQ...sQR,.G20EtV.UIU.U76.0E..k.8ssH38pumqm8rtumdcIf9PY62Ulb8
.CLL8pumqmY62UmT.6.QJw.Qo.E.0U10.bf9bWHZitZhZZcZRC,Mw.ELMSN12Umz.6..F.p0,6
.IE.EL4Iu.6F6.G.0..676.16.6.665hKE.SoA5UTyB4.4.0E.cUZT1E..UO.,.1.eWUbFE.0t
.U..61lbAUgQnPt0lLU8ssH2.Cor..626..U6U..HE.6UjuQmmECe.az86Utj00khWagaYM04N
,...
--- end of encoding ---
Attachments
greek pi cut'n'pasted
greek pi cut'n'pasted
Pi.JPG (9.44 KiB) Viewed 64181 times
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

Bernhard,
I believe the rendering of a character is not (completely) specified in Unicode or UTF-8 that is left up to the browser or editor.

"Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters."

"Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "replacement character" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters."

"All W3C recommendations have used Unicode as their document character set since HTML 4.0. Web browsers have supported Unicode, especially UTF-8, for many years. Display problems result primarily from font related issues; in particular, versions of Microsoft Internet Explorer do not render many code points unless explicitly told to use a font that contains them."

"Free and retail fonts based on Unicode are widely available, since TrueType and OpenType support Unicode. These font formats map Unicode code points to glyphs."

From BB docu on cross platform issues "Furthermore, you may want to use fonts which are available on both platforms, either TrueType fonts or PostScript fonts if you have Adobe's Type Manager available. If you don't have equivalent fonts on both platforms, you can still read the text, with another font substituted for the correct one. This will likely result in visually unsatisfactory results.
Use the default font for cross-platform documents, if you have no preference for a specific font. Typically, the default font is used for on-line documentation. The default font automatically adapts to the reader's system configuration, so that he or she can select their preferred font."

-Doug
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Feature #9: adding module Characters

Post by Zinn »

1. Cut and paste have nothing to do with the character representation of the identifiers.
It is a complete different topic. You may have the same problem with 1.6
I can cut and past the letter π.

2. Module Characters moved to Kernel and Strings.
See the beginning of this discussion or point 75. in the description.

3. What do you want?
- Would you like to use Cyrillic & Greek letters in identifiers?
- Do you want to use both character sets in the same project?
- Would you like to use all letters right now or wait another 5 years?

4. Which way would you like to go?
- Using SHORTCHAR via Code Page (see point 5. below)
- Using SHORTCHAR via Utf8 (current solution)
- Using CHAR via Unicode (see point 6. below)

5. You can use the code page version by replace the procedures
- Kernel.ToUtf8 with Kernel.ToShort
- Kernel.FromUtf8 with Kernel.ToLong
in all modules, compile and link it. You need about 15 minutes for doing it.
You find the procedures ToShort and ToLong in the deleted module Characters.

6. You can download a CHAR version at
http://forum.oberoncore.ru/viewtopic.php?p=71322#p71322
bb16uni.zip

7. The symbol file of my version (UTF-8) is compatible to 1.6 when only the
Characters ‘A’..’Z’, ‘a’..’z’, ‘0’..’9’, ’_‘ are used.
Bernhard
Posts: 68
Joined: Tue Sep 17, 2013 6:56 am
Location: Munich, Germany

Re: Feature #9: adding module Characters

Post by Bernhard »

Zinn wrote:1. Cut and paste have nothing to do with the character representation of the identifiers.
It is a complete different topic. You may have the same problem with 1.6
I can cut and past the letter π.
ok. But I have to find a way to enter Unicode characters into a document and cut'n'paste is my approach.

First Problem solved: My mistake, I should have used Edit->Paste Special->Unicode Text

But now, I've got an constant π, which is not accepted by the compiler (see enclosed Stdcoded View with folds expanded).

StdCoder.Decode ..,, ..1L....3Qw7uP5PRPPNR9Rbf9b8R79FTvMf1GomCrlAy2xhX,Cb2x
hXhC6FU1xhiZiVBhihgmRiioedhgrZcZRiXFfaqmSrtuGfa4700zdGrr8rmCLLCJuyKtYcZRiX
7.2.s,6G9.0k,5TWyql.bnayKmKKqGomC5XzET1.PuP.MHT9N9ntumaU2,CJuyKtQC98P9PP7O
NbXmb.2.om2k2E5H.,6.cUGpmWLuOpoKqvCbHZiYpedhA704TeKKw.bHfEWUmL.6..D.l7,6.C
cIhgsNHT9N9ntQ8qorG4704D.CbB,708T1U.EV8.T.pd.2.,U,Ql,U00.bnUGLu8ro8quGrmCL
WKqtE0E.EQ0.,.p.0.4E.6.JFkns.U.2m,.F,,c.E.E7Gr.YjyC.3Qwb8R7fFT9P7vQRdFT9P7
9F9vQ5X1U.2.e5m.04,6.k.sUZz0E..y.8I.E.Uq.3gwP.0..I16.M.2.J,U.2GE,kz1roozzz
zzX.2UG.Etv.0E.s861M2E.A.1U.G..S40E.EDmK,ow6I.3Qw7ONhvETPPPPMR9N9fQbf9b8RO
3U.Ay2hgq,.RdJ.0Et,,..O.2.f.T.zTHT8Ff8H986dONb9RfePHvMT9N9vCPM1HOHVuHZ8J,N
H19RFvCPM15uHRuIdO1Hc.,yU366v76bd9X7BXNBndAhNBbNBlNCjNCbtCPM1VeITuE98FfeI9
867uPJtCPM0hOEZO1HM0l96p76ZOF18HrN1HcE9uFHeHPM0H.v76POMdHL0poWmIin4ak2WoUm
IeWGMam4KIbGIEGorin4qkWu236J9Hu.oZ2xhBIklbeZlVyKrGLtyKqmqm8rtumdsEdfQN9F9v
Q59.XDJ..oZ1xhiZCU2hgnRg.sEMM.Et...ktu0.Y62Umb.2.Y02.A,,E.0..4E,5TeEdKLqKK
tCLLC3ZORNX2V.AyI,ktuGdKLqKa2V.Iy1U.2.i8S.C80E.QE.sQRtIQeoBjghg2hgnRg.AS.c
9Ajg,0EtT.2.U6UO,,U0CS.c918Rd1EWsM,E0E...7,,M.,6.,E.EECOhU.wcNC.zwPA.A.2U.
E,9D6..EBU.U,.J,U.2m,.,.E4WDN.Ntarm3Wj.Jklbcjlq.5uP..QW.U...F.,.aU.E.TptYZ
VQI,AzJE.nT32UP3BNB7l2WH0...
--- end of encoding ---
Zinn wrote: 3. What do you want?
- Would you like to use Cyrillic & Greek letters in identifiers?
I'd like to test your solution.
Zinn wrote: - Would you like to use all letters right now or wait another 5 years?
absolutely, but if it does not work for the first letter I am trying π, I am frustrated, and I might be better off to wait, but the question is: wait for what?
Wait for myself (as part of our group) to decide? That is a strange loop.
Zinn wrote: 4. Which way would you like to go?
- Using SHORTCHAR via Code Page (see point 5. below)
- Using SHORTCHAR via Utf8 (current solution)
- Using CHAR via Unicode (see point 6. below)
I want to look at your solution ...

what is current?

As far as I understand Ivan, your version is the only one, which implements the "UTF-8 - SHORTSTRING" aproach labeled by Josef as UTF-8 packed into "ARRAY OF BYTE".
--
Bernhard
Last edited by Bernhard on Thu Sep 25, 2014 10:32 am, edited 1 time in total.
Post Reply