issue-#9: adding module Characters → #19

Josef Templ · Post by **Josef Templ** » Thu Aug 21, 2014 4:58 pm

thank you Helmut, it was mostly a misunderstanding.
I hope that we can also convince Ivan.

I think I understand his idea in the meanwhile.
He wants to feed punycode into the compiler instead of the
original source code. However, this will confuse the Scanner
and will lead to major problems in the compiler and system.
I am pretty sure that it is much more complicated if possible at all.

- Josef

Ivan Denisov · Post by **Ivan Denisov** » Thu Aug 21, 2014 5:17 pm

Josef Templ wrote:thank you Helmut, it was mostly a misunderstanding.
I hope that we can also convince Ivan.

I think I understand his idea in the meanwhile.
He wants to feed punycode into the compiler instead of the
original source code. However, this will confuse the Scanner
and will lead to major problems in the compiler and system.
I am pretty sure that it is much more complicated if possible at all.

You do not need convince me. I gave some ideas, because you ask me to think about this problem. Let's do like you like and maybe you will come to punycode later if there will be problems with code pages. Let's go on.

Zinn · Post by **Zinn** » Thu Aug 21, 2014 5:48 pm

Josef, I agree that punycode will not work with the compiler, but Utf-8 may work because 0..7F is equal to the ASCII representation which the compiler understand.
I eliminate in my implementation the procedures Long & Short based on a single character and uses only the procedures ToLong & ToShort based on strings.
The current implementations of ToLong & ToShort translate between ASCII and Unicode. It can be exchange against translate between Utf-8 and Unicode. This may solve the code page problem.

Josef Templ · Post by **Josef Templ** » Fri Aug 22, 2014 9:54 am

Helmut, we can support the eastern community very well with the
Short/Long on a character basis.
I don't see any need to experiment with further alternatives.

The principle is:
You cannot use SHORT for dealing with anything other than code page 1252 (ISO Latin-1)
unless you restrict yourself to 7-bit ASCII.
You can use Strings.Short for dealing with the currently installed code page.
If it happens to be 1252, no change to SHORT. If it is something different, it is also
possible now but was not possible before.

Moving persistent data that includes national characters between different code pages is
of course not possible. This is the same for Windows text files
that use national characters and should be clear to anybody.
For processing such files, using Strings.Short/Long is the way to do it and we should not
restrict the BlackBox users in any way. After all, BlackBox is a general purpose
programming system and it is not our task to tell them what to do with BlackBox
but to make it as general purpose as possible.

If code pages are eventually removed from Windows, which in my opinion is far away because
it would break thousands of programs and millions of text files, we have to switch to
a new symbol and object file format. For compactness and simplicity this would
of course use UTF-8 then.

If there is any need for a conversion to or from UTF-8 we can add this as additional
conversion functions. Actually the Xhtml subsystem already uses UTF-8 for
reading .html files. But this is a different topic.

If there is any need for a conversion to or from punycode, we could also add
it as additional conversion functions.

- Josef

Zinn · Post by **Zinn** » Mon Aug 25, 2014 9:49 pm

Josef,

I also have the Utf8 version running.

In the Utf8 version all Unicode characters are displayed correctly.
With the code page version only one code page is displayed correctly.

Should I rename the procedure ToLong to FromUtf8 and the procedure ToShort to ToUtf8 and delete the code page version?

Helmut

DGDanforth · Post by **DGDanforth** » Mon Aug 25, 2014 10:48 pm

Zinn wrote:Josef,

I also have the Utf8 version running.

In the Utf8 version all Unicode characters are displayed correctly.
With the code page version only one code page is displayed correctly.

Should I rename the procedure ToLong to FromUtf8 and the procedure ToShort to ToUtf8 and delete the code page version?

Helmut

What are the consequences of retaining the 'ToLong' and 'ToShort' names (for the Utf8 representations)?

Ivan Denisov · Post by **Ivan Denisov** » Tue Aug 26, 2014 2:47 am

Zinn wrote:Should I rename the procedure ToLong to FromUtf8 and the procedure ToShort to ToUtf8 and delete the code page version?

My opinion is Yes.

Josef Templ · Post by **Josef Templ** » Tue Aug 26, 2014 7:56 am

Conversion to and from UTF-8 is a separate and independent topic.
It MUST NOT be mixed up with code page aware conversion
Short/Long/ToShort/ToLong.

Again, there are many possible encoding for strings such as
UTF-8, UTF-16, punycode, etc, etc.
Code page aware conversion is only one special kind of conversion.
It enhances the built-in capabilities of BlackBox (SHORT/LONG),
which only work for the ISO Latin-1 code page.
In contrast, Short/Long/ToShort/ToLong work for any installed code page.
Because it enhances the built-in conversions (SHORT/LONG),
which also work on a character-by-character level, it is named Short/Long/ToShort/ToLong.
UTF-8 does not work on a character-by character level. It produces for a single CHAR
1 or multiple Bytes and cannot be reversed on a character-by-character level.
This is a completely different approach.

- Josef

Zinn · Post by **Zinn** » Tue Aug 26, 2014 2:57 pm

Josef, sorry I don’t understand.
First I had a module Characters which separates the complete task and you argue the task is for common use.
Now I have it in Strings and you argue after I changed the implementation the task is too special and I should not mix it up.
So what?

No, the Conversion to and from UTF-8 it is not an independent topic.
It is just the solution of our problem.
If you won’t have it in the Kernel & Strings then I have to roll back to my original solution.

Of course, there are other solutions possible. The first one was code page.
Now I have a better one called Utf8 and this solution does its task better as the code page one.
I’m happy with the Utf8 solution and I delete the code page implementation of (To)Short & (To)Long.
I won’t support code page at all.

I have not the problem with the single CHAR and the multiple Bytes.
Remember I don't translate single character, I translate the whole identifier.

- Helmut

Ivan Denisov · Post by **Ivan Denisov** » Wed Aug 27, 2014 3:18 am

This is the lesson for future. We need to start new Features from a description of problem (a clear task to solve), not from a solution.

I am on Helmut's side, because he think about the problem: unicode identifiers. Codepages in Characters was an temporary solution. If UTF8 solution works, that is the grate new.

BlackBox Framework Center

issue-#9: adding module Characters → #19

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters