issue-#9: adding module Characters → #19

Zinn · Post by **Zinn** » Mon Aug 18, 2014 5:32 pm

> When we can at the same time add more Unicode support to Strings, so what is the problem?

The module Strings imports the module Math. After merging module Characters into module Strings the module Math and Strings must be compile before Kernel and the link command will be started with DevLinker.Link BlackBox2.exe := Math Strings Kernel$+ Files HostFiles StdLoader ...

Cap renamed to Upper and Low renamed to Lower.

Is that ok for you?

Josef Templ · Post by **Josef Templ** » Mon Aug 18, 2014 8:33 pm

Of course not, Helmut, but I think I understand now what the misunderstanding is.

My approach is to move everything that is needed by module Kernel into module Kernel DIRECTLY.
Kernel then remains the lowest level module.
If anything is needed by module Strings that is already needed by module Kernel,
it is IMPORTED from module Kernel by module Strings. There is absolutely no magic in that.
If anything else is needed by module Strings that is platform dependent, it should be imported from
some platform specific module. Module Kernel can also serve that purpose very well because
it is the ultimate platform dependent module anyway. The introduction of something like HostStrings
looks like an overkill to me. In particular because most if not all of the platform dependent stuff is
needed in module Kernel anyway.

- Josef

Josef Templ · Post by **Josef Templ** » Tue Aug 19, 2014 9:20 am

The most current commit to the feature#9 branch contains
a Kernel module that provides the required functionality and an
extended module Strings that uses it.
see https://github.com/BlackBoxCenter/bbcb/ ... 356f071ef6

There is full Unicode support available now for Upper/Lower, Long/Short, etc.,
i.e not only for a subset of languages (greek, cyrillc) but for all of them.
This is possible by using the corresponding WinApi functions directly rather than
re-implementing parts of them manually. The implementation thereby gets a lot simpler.
The common case of ASCII characters is still optimized in Kernel because
calling WinApi functions results in some overhead (approx. factor 8).

The proposal also avoids the need for the usePage variable and the named constants
that were needed in Characters. Module Kernel adapts itself to the code page that is in use
and exports it in the variable 'acp' (ANSI Code Page).

I cannot test this for anything else then the Windows 1252 code page.
The docu of module Strings is not yet updated.

The proposal follows the naming conventions already in place in module Strings (Upper, Lower, ToUpper, ToLower)
and adds Long, Short, ToLong, ToShort, and some IsXXX functions as proposed in Characters.

There is no change in the compile list or compile order and there are no new files involved.
In particular, there is no module below Kernel.

- Josef

Zinn · Post by **Zinn** » Wed Aug 20, 2014 5:51 am

I don’t like your solution. It solves your aim but it also destroy my work. I don’t know how to explain. Let me try.

Module Characters consist of 3 parts:
1. Some characters staff (IsLetter, IsCap, IsLow, Cap and Low)
2. Detecting identifiers (IsFirstIdentChar and IsIdentChar)
3. Converting identifiers (LongIndent and ShortIdent)

The task of part 3 is to convert the identifier representation from 16 bit to 8 bit and back. I used the code page approach in this implementation. Another implementation could be utf-8. Maybe this will be a better one. If the symbol file of the compiler uses CHAR instead of SHORTCHAR then part 3 is not needed because the output is equal the input.

The task of part 2 is to detect identifier for the compiler. It is not a common facility for the module Strings.

Part 1 is the staff which I need for part 2. It defines which character is allowed as identifier. It is not the task to define which character is a Unicode letter. It is the same only if you define all Unicode letters are allowed as identifier. At the current state I won’t allow all Unicode letters as identifier. I overlay Cyrillic and Greek letters to the same short representation which is not good.

That is the reason why I declare module Characters as a private interface.

Josef Templ · Post by **Josef Templ** » Wed Aug 20, 2014 8:46 am

> I don’t like your solution. It solves your aim but it also destroy my work.

IMHO, I added value to your work by generalizing it and making it less intrusive.
I have no personal interest in adding this feature at all, but I followed your
argument that we should also support the eastern community.

> The task of part 2 is to detecting identifier for the compiler. It is not a common facility for the module Strings.

Basically, I reused YOUR design of module Characters, which
will be used in many client modules and which therefore cannot have a private interface.
The existing 'private' interfaces in BlackBox are a pain and we should not add another one.
The naming and the semantics can be discussed, though.

The general purpose character classes could be defined by:
exists: IsUpper
exists: IsLower
IsLetter -> IsAlpha (the term 'letter' is often associated with any printable character, not just alphabetical characters)
new: IsNumeric (for the sake of completeness and convenience; I don't really care about that )
new: IsAlphaNumeric (for the sake of completeness and convenience; I don't really care about that )

The character classes used for CP could be defined by:
IsIdentChar -> IsIdent (there is no 'Char' postfix in other IsXXX, so why should there be one here?)
IsFirstIdentChar ->IsIdentStart (don't know why but seems a bit more readable and expressive to me)

There cannot be any confusion if the docu states explicitly that these latter character classes are related to
identifiers as defined in the CP language.
They are more than pure convenience classes because as you pointed out correctly they
allow us to define the valid characters for CP identifiers at one single place
with guaranteed consistency with all the client modules.

For the changes see https://github.com/BlackBoxCenter/bbcb/ ... 320ace4714

> I overlay Cyrillic and Greek letters to the same short representation which is not good.

I have seen that and it is not yet covered in my proposal.
There is some work needed in finding out if this overlay is required at all.
To me it seems that it is for fixing bugs in other places,
viz. places where Short/Long should be called but is not (yet) called.

> Converting identifiers (LongIndent and ShortIdent)

sorry, but the names LongIndent/ShortIdent are obviously wrong because they are
in no way related with identifiers but they work on all strings.

- Josef

Ivan Denisov · Post by **Ivan Denisov** » Wed Aug 20, 2014 1:43 pm

Josef said that his aim was to provide "conversion between CHAR and SHORTCHAR".

However solution depend on the application of this conversion:
1. If we need this for compilation purposes it should be included in the Kernel, use less platform dependent libraries and resulting strings do not need to be readable after conversion. So I am suggesting to use Punycode for conversion of native CHARs to technical SHORTCHAR representation. Punycode is the world experience with such problem with national URLs. (I do not think, that we need to include Unicode support to symbol files, because this can decrease the performance.)

2. Other aim is to provide Unicode tools. And all this stuff should be in Strings module. However we need to separate Strings to Strings and HostStrings, because this solutions are platform dependent.

So, Josef (Goryachev) solution should be moved to HostStrings, and hooked to Strings. And we need to include in Kernel some small peace of code according this Punycode standard for CHAR to SHORTCHAR technical conversion.

Josef Templ · Post by **Josef Templ** » Wed Aug 20, 2014 4:06 pm

Ivan, are you suggesting to change the symbol and object file format?

Following Helmut's approach, I tried to avoid that. It is a lot of changes.
Helmut's approach can support the eastern community without changing the
symbol and object file formats. This is a compromise, yes,
but we also have to look at our resources. Helmut's approach
is basically ready, I mean the changes of the affected modules.
Renaming Character.Short to Strings.Short is trivial. The main work
is in identifying the places that benefit from making use of it.
And that would not be affected by my approach, which is actually Helmut's approach
but slightly streamlined and generalized.
And in passing Strings gets better Unicode support.
And Short is a meaningful function for Unicode characters if you
want to process Windows text files on your machine.

- Josef

Ivan Denisov · Post by **Ivan Denisov** » Wed Aug 20, 2014 5:36 pm

Josef Templ wrote:Ivan, are you suggesting to change the symbol and object file format?

No!

Ivan Denisov wrote:I do not think, that we need to include Unicode support to symbol files, because this can decrease the performance.

It was reply to Helmut for:

Zinn wrote:...First we need the solution to use Unicode inside the object file. This kind of change is very difficult. I won’t do it. I wait for someone else does this job...

Josef Templ wrote:Following Helmut's approach, I tried to avoid that. It is a lot of changes.

If I understood Helmut right, you misunderstood him.

Second. WideCharToMultiByte usage should be avoided because the codepage concept is deprecated.

Note New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization.

Also if I will have in one program Russian and European ó ú ü, the codepage 1251 can not represent all this, however Punycode will.

DGDanforth · Post by **DGDanforth** » Thu Aug 21, 2014 1:29 am

General comment:
The back and forth between Helmut, Josef, and Ivan is part of the "merge" process of
different development branches. As we see, merging is not easy and (I believe) it can not be automated.

Zinn · Post by **Zinn** » Thu Aug 21, 2014 4:40 pm

Hallo Josef
I apologize for the words in my last contribution here. Your solution is great. I have it complete running in the CPC-Edition. It makes the live much easier. Module Characters is deleted. Thank you for your help.
Helmut

BlackBox Framework Center

issue-#9: adding module Characters → #19

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters