issue-#9: adding module Characters → #19

Merged to the master branch
Zinn
Posts: 476
Joined: Tue Mar 25, 2014 5:56 pm
Location: Frankfurt am Main
Contact:

Re: Feature #9: adding module Characters

Post by Zinn »

Josef, I have done all changes that you suggested.
Further I moved the translation between CHAR and Utf8 more closed to Kernel.
That makes the life easier.
Must the view in DevDebug downward compatible with 1.6?
Should I insert version and all those stuff for compatibility?
Or can I simply change wr.WriteSString(v.name) to wr.WriteString(v.name) and the Read too?
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Feature #9: adding module Characters

Post by Josef Templ »

I don't know how it looks like now.

Anyway, in the previous DevDebug there were a lot of patterns like
Strings.FromUtf8(x, y); out.WriteString(y)

I think this pattern should be refactored into a new private procedure
WriteUtf8(x).
This makes the code simpler to read, the differences to the previous version more obvious,
and it avoids the introduction of auxiliary variable y in a lot of places.
You may also consider using Kernel.NameLen instead of 256.

Regarding the version of RefView:
I would leave it as it is, incl. the WriteSString.
Then it is fully compatible with existing files (rarely used anyway) for ASCII characters.
It still loads when extended ASCII has been used, according to my tests.
Module references that include extended ASCII will not work properly when clicked on.
Instead there will be a file open dialog. Such references are not
meaningful anyway when stored to a file.

- Josef
cfbsoftware
Posts: 204
Joined: Wed Sep 18, 2013 10:06 pm
Contact:

Re: Feature #9: adding module Characters

Post by cfbsoftware »

I am finding it increasingly difficult to follow exactly what is going on here. A change as fundamental and as significant as this requires a more analytical / scientific planned approach than appears to be happening. The requirements need to be firmly established and a technical design document that covers all ramifications of the changes is needed so that a proposed solution can be agreed on. Then, and only then, should any significant development be started.

This is not the first system by any means that has had its character set extensively modified. I have been doing some outside investigation in an effort to evaluate the various approaches that have been / could be used and came across the following informative links:

MFC support for MBCS deprecated in Visual Studio 2013

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

UTF8 Everywhere
Bernhard
Posts: 68
Joined: Tue Sep 17, 2013 6:56 am
Location: Munich, Germany

Re: Feature #9: adding module Characters

Post by Bernhard »

cfbsoftware wrote:I am finding it increasingly difficult to follow exactly what is going on here. A change as fundamental and as significant as this requires a more analytical / scientific planned approach than appears to be happening. The requirements need to be firmly established and a technical design document that covers all ramifications of the changes is needed so that a proposed solution can be agreed on. Then, and only then, should any significant development be started.
absolutely.

Currently I look at what is going on more as a proof of / experiment with a concept, i.e. how it could be done.
But I do not know if Helmut & Josef share my opinion.

Helmut & Josef, are you listening?

--
Bernhard
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

I would think the 'CHAR' type should be abstracted to represent a UTF-8 'character' whose bit representation is dependent upon the actual code point. The SHORTCHAR type would then be deprecated (removed). The LEN of a string would then be the number of CHAR (not the number of bytes).

Now I have to look back and see if that interpretation conflicts with what Josef has written.

Doug
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

Josef said:
"the question if we should name
the conversion procedures to StringToUtf8 and Utf8ToString"

Under my suggested use of CHAR there would be no such explicit conversion. All characters are UTF-8 encoded.
Reading legacy text files would convert automatically to UTF-8 internal representation.
Ivan Denisov
Posts: 1700
Joined: Tue Sep 17, 2013 12:21 am
Location: Russia

Re: Feature #9: adding module Characters

Post by Ivan Denisov »

Doug, Unicode (CHAR) can be easily converted to UTF8 and vice-versa. Depricating of SHORTCHAR is something revolutionalry and I think this should be avoided. Let's keep Component Pascal as it is and think about the Framework.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

Ivan Denisov wrote:Doug, Unicode (CHAR) can be easily converted to UTF8 and vice-versa. Depricating of SHORTCHAR is something revolutionalry and I think this should be avoided. Let's keep Component Pascal as it is and think about the Framework.
The point I am making is that the Component Pascal type "CHAR" would be elevated to an abstract type.
The need for Unicode to handle multiple languages on this planet should be embraced by Component Pascal by generalizing CHAR. To muck around with explicit conversions clutters Component Pascal and makes it less "modern". One can also avoid deprecating SHORTCHAR by simply making it equal to CHAR (whose internal representation can still be 7 bits).

Before we do such a thing to the language we can continue to kludge along with character conversions as have been suggested but this whole topic shows the inadequacy of our legacy software.

Also 64 bit systems expose similar problems where extra long integers are needed. If one were to also elevate INTEGER (and all types) to an abstract level then the function SIZE would become meaningless. It could not represent the size of a UTF8 character for they vary depending upon the actual character. Instead of applying SIZE(type) one would need to apply it to variables, SIZE(var) whose value would depend upon the implementation.

I am simply looking for clean abstractions of the issues.
User avatar
DGDanforth
Posts: 1061
Joined: Tue Sep 17, 2013 1:16 am
Location: Palo Alto, California, USA
Contact:

Re: Feature #9: adding module Characters

Post by DGDanforth »

I should create a separate topic but will stick with this one for now since it initiated the concept.

When Omic created Component Pascal they included the predeclared procedure SIZE.

"SIZE(T) any type INTEGER number of bytes required by T

SIZE cannot be used in constant expressions because its value depends on the actual compiler implementation."

So SIZE is implementation dependent. As such, its use does not make CP platform independent UNLESS the implementation between platforms is guaranteed to be the same. I don't believe Omic made any such guarantees.
I think SIZE should be move to SYSTEM.SIZE indicating its dependence on the implementation.
Also because UTF-8 uses variable byte length characters the size of a character now becomes dependent upon the value of the character rather than on its type.

Stepping back to a high level of language design it seems to me that a language specification should not specify anywhere the number of bytes used for a type. By eliminating such references the language can truly be made platform independent WHERE part of the language specification is that the results of any computation are identical across platforms. That may lead to sub optimal code on some platforms BUT the guarantee of identical results more than outweighs (IMHO) that possible inefficiency.

Secondly, it would be the job of the underlying implementation to use the minimal number of bytes necessary to compute a result. That would mean a REAL might actually be expressible in a single byte. How to do such "compression" and not kill execution time is part of the challenge. An INTEGER can also be considered a 0-precision REAL when reals can be compressed. That also suggests that INTEGER and REAL could be combined to be NUMERIC. Conversion (compression or expansion) of NUMERIC values would be done automatically internally.

Now that I have wondered far off base we can get back to Characters. I just wanted to get those ideas "out there".
User avatar
Josef Templ
Posts: 2047
Joined: Tue Sep 17, 2013 6:50 am

Re: Feature #9: adding module Characters

Post by Josef Templ »

Ivan Denisov wrote:Doug, Unicode (CHAR) can be easily converted to UTF8 and vice-versa. Deprecating of SHORTCHAR is something revolutionary and I think this should be avoided. Let's keep Component Pascal as it is and think about the Framework.
I could not agree more.

- Josef
Post Reply