issue-#9: adding module Characters → #19

Zinn · Post by **Zinn** » Thu Sep 25, 2014 8:44 am

Bernhard, your program compiles and runs with the CPC Edition
of Blackbox Component Builder 1.7-RC3 built on 31.08.2014
You can download it from http://www.zinnamturm.eu
I call it current version in my last post.

Here we discuss if my changes should be insert into the
BlackBox Framework Center Version or not.
Currently the Framework Center Version is equal to BB 1.6 final
with a new logo in the Help -> About BlackBox.

About cut and paste problems please create another topic.
Certainly there are several problems and different results depending
on the steps which you are doing.

Bernhard · Post by **Bernhard** » Thu Sep 25, 2014 9:40 am

Zinn wrote:Bernhard, your program compiles and runs with the CPC Edition
of Blackbox Component Builder 1.7-RC3 built on 31.08.2014
You can download it from http://www.zinnamturm.eu
I call it current version in my last post.

I thought I was using exactly this version. But somehow I got mixed up and used the BB-FrameworkCenter version.I have no idea why, most probably I unpacked the wrong Archive ...

Zinn wrote: Here we discuss if my changes should be insert into the BlackBox Framework Center Version or not.

I know and this is the reason why I wanted to look at it.
--
Bernhard

Josef Templ · Post by **Josef Templ** » Thu Sep 25, 2014 10:30 am

Ivan Denisov wrote:We should not use code pages.
Helmut's solution works well and it supports all the cases of Unicode identifiers.
The solution has few bugs, but they are small (example, when modules unloaded the log massage is wrong).
I think, that we need to include this solution as fast as possible. And if it touch mach files it is not a problem, from my point of view.

I did some tests with Helmut's latest version and everything worked fine.
I could not observe a bug when unloading a module so far.
Ivan, could you explain the bug in detail?
The only bug I found so far is the check for MaxIdLen.
This bug is related with the way conversion to Utf8 is done.

The conversion to Utf-8 silently truncates the result if it is too long.
While such a behavior can also be found in other Strings routines,
it is probably not adequate for Utf-8 conversion.
With a result flag (see Strings.StringToInt for example) it would
be easy to check for MaxIdLen by simply testing for truncation
after the conversion to Utf8 (res: 0 = OK, 1 = truncated, 2 = format error)
If we follow that pattern, it is also the question if we should name
the conversion procedures to StringToUtf8 and Utf8ToString.

- Josef

Zinn · Post by **Zinn** » Thu Sep 25, 2014 11:17 am

Ivan found the error at Compile and Unload:
Change in the Modul DevCompiler.CompileAndUnload
after Kernel.UnloadMod(mod) the line to
(* n := DevCPT.SelfName$; *) Strings.FromUtf8(DevCPT.SelfName$, n);
and Compile and Unload also reports the module name correctly.

Josef Templ · Post by **Josef Templ** » Fri Sep 26, 2014 7:12 am

Zinn wrote:Ivan found the error at Compile and Unload:
Change in the Modul DevCompiler.CompileAndUnload
after Kernel.UnloadMod(mod) the line to
(* n := DevCPT.SelfName$; *) Strings.FromUtf8(DevCPT.SelfName$, n);
and Compile and Unload also reports the module name correctly.

Thanks.
Helmut, do you have any comments on the 'silent truncation' issue?
Or other ideas how to fix the MaxIdLen bug in a cheap way?

- Josef

DGDanforth · Post by **DGDanforth** » Fri Sep 26, 2014 7:27 am

I am still a little hazy on the utility of using UTF-8.
I have for many years used the Symbol typeface for identifiers and the compiler accepts them.
For example 'q' becomes 'θ'.

So the Symbol typeface is accepted by the BB compiler.
If that is true then is there a Cyrillic typeface that is accepted by the BB compiler?
Searching for Cyrillic typeface yields this

What is the relationship between typeface and what the BB compiler will accept?

-Doug

Josef Templ · Post by **Josef Templ** » Fri Sep 26, 2014 7:56 am

DGDanforth wrote: What is the relationship between typeface and what the BB compiler will accept?

-Doug

The compiler ignores the font and other text attributes.
In ETH terminology: "the compiler does an ASCII projection of the text".
This means it ignores all attributes such as font, size, subscript etc.
It only looks at the character codes.
Now we don't use plain ASCII but 16-bit Unicode, which is
already supported by Blackbox text documents.
Thus, the Blackbox compiler now does a 'Unicode projection' of the text
when reading the source code.

Be aware that you can produce unexpected results when using the Symbol font
depending on the character codes that happen to be used for the various glyphs.
Note that this was also the case before Helmut's changes.

For a compact representation of identifiers in symbol files and object files Helmut's compiler
uses the well-known UTF-8 encoding. In addition it uses UTF-8 also as an internal representation of identifiers.
Please have a look at the related Wikipedia article or other introductory materials
for UTF-8, which is a simple but highly sophisticated encoding.
It is sophisticated in the sense that it provides a certain degree of compatibility with
single byte character strings and for plain ASCII there are no changes at all.

- Josef

Bernhard · Post by **Bernhard** » Fri Sep 26, 2014 9:10 am

When going back to the language Report for checking what an identifier is, I just realized that allowing cyrillic and greek letters in identifiers requires a change of the report. Currently it states clearly (Section 3 Vocabulary and Representation):

The representation of (terminal) symbols in terms of characters is defined using ISO 8859-1, i.e., the Latin1 extension of the ASCII character set. Unicode (16 bit) characters are allowed in string constants only. Symbols are identifiers, numbers, strings, operators, and delimiters.

and

ident = (letter | "_") {letter | "_" | digit}.
letter = "A" .. "Z" | "a" .. "z" | "À".."Ö" | "Ø".."ö" | "ø".."ÿ".

Since I did (not yet) check the details of Helmuts implementation (due to time constraints), I do not yet know how the above definitions have to be adjusted to acoount for the changes.

IMHO it is getting complicated. But the definiton of a letter from the Language Report above, which allowed for diacritical marks on letters as possible in the Latin-1 extension of the ASCII code set, has set already a starting point for the complications.
--
Bernhard

Zinn · Post by **Zinn** » Fri Sep 26, 2014 6:37 pm

Josef Templ wrote:Helmut, do you have any comments on the 'silent truncation' issue?
Or other ideas how to fix the MaxIdLen bug in a cheap way?

Josef, I still think about your questions. I have not found my preference about the silent truncation.
Currently I would like to leave it as it is until some other questions worked out.

About the MaxIdLen bug I search for the position where the error should be detected.
It has something to do with variable lenght of the single character.
The silent truncation could be also the source of this trouble.
Maybe increasing the size of Kernel.Name over 256 will solve the problem.

My current questions is to move the the bound (translation) between CHAR and SHORTCHAR (UTF8)
to the Kernel or leave it as it is. The move is not as easy as it looks like.

I think about to rename the type Kernel.Name to Kernel.Utf8Name to make the program more readable.

To rename ToUtf8 to StringToUtf8 and FromUtf8 to Utf8ToString is a good idea.

Josef Templ · Post by **Josef Templ** » Sat Sep 27, 2014 1:21 pm

Helmut, if you have a 'res' parameter that signals truncation, you could leave the
decision how to deal with truncations to the client.
There is no runtime overhead for detecting truncation in StringToUtf8.
In most if not all places truncation needs to be checked by the client.
Otherwise you may get a string that is truncated at some arbitrary position,
which is hard to imagine to be useful. In most situations you get follow-up
errors at some other place. If it is useful, it can still be ignored.

In the compiler, it doesn't really matter how large MaxIdLen is because
it is very large anyway. Even 1/3 of it is still very large (85).
My only concern is that it should not be an ignored
limitation that leads to unexpected results in the extreme case.

In general, Strings.StringToUtf8 is much more 'general purpose' if it returns
a 'res' parameter that makes it simple to check for truncation.
It also makes it explicit that there is the danger of truncation,
which might otherwise be overlooked by the programmer!

The simplest solution is to modify DevCPS as outlined below. Please note the
definition and usage of MaxIdLen for consistency between DevCPS and DevCPT.

Code: Select all

	PROCEDURE Identifier(VAR sym: BYTE);
		VAR i, res: INTEGER; n: ARRAY MaxIdLen OF CHAR;
	BEGIN i := 0;
		REPEAT
			n[i] := ch; INC(i); DevCPM.Get(ch)
		UNTIL ~Strings.IsIdent(ch) OR (i = MaxIdLen);
		IF i = MaxIdLen THEN err(240); DEC(i) END ;
		n[i] := 0X; Strings.StringToUtf8(n, name, res); sym := ident;
		IF res = 1 (*truncated*) THEN err(240) END
	END Identifier;

with

Code: Select all

	CONST
		MaxIdLen = LEN(DevCPT.Name);

- Josef

BlackBox Framework Center

issue-#9: adding module Characters → #19

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters

Re: Feature #9: adding module Characters