Josef Templ wrote:I cannot read this weird grammer notation.
According to Wikipedia it is a correct Utf-8 sequence
because the tail has the right number of bytes and all tail bytes have high bits 10.
If we will make it like wikipedia tells, there can be problems in future, because people will use this converter for various purposes.
The RFC 3629 is today de facto standart for UTF-8. There is links from Wiki on it also.
The unicode.orgwarning about our case and "prohibit encoding of certain invalid characters".
http://www.unicode.org/faq/utf_bom.html#utf8-1 wrote:A: UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding Forms ” in The Unicode Standard. See, in particular, Table 3-6 UTF-8 Bit Distribution and Table 3-7 Well-formed UTF-8 Byte Sequences, which give succinct summaries of the encoding form. Make sure you refer to the latest version of the Unicode Standard, as the Unicode Technical Committee has tightened the definition of UTF-8 over time to more strictly enforce unique sequences and to prohibit encoding of certain invalid characters. There is an Internet RFC 3629 about UTF-8. UTF-8 is also defined in Annex D of ISO/IEC 10646. See also the question above, How do I write a UTF converter?
Ivan, instead of pointing us to a ton of heavy-weight documents,
please tell us what the problem is in simple words.
Currently I don't have the time to figure this out in detail.
I found the following specification for implementation of Utf8 to String converter:
None of the UTFs can generate every arbitrary byte sequence.
For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2.
A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated.
When faced with this illegal byte sequence while transforming or interpreting,
a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error:
for example, either signaling an error, filtering the byte out,
or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER).
In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.
Here is my solution which obeys the specification above:
Helmut, your version has a serious bug.
It may return truncation in case of a format error.
Also there is no decent recovery after a format error.
I tried to avoid the problem of recovery after a format error
by returning immediately.
Ivan, if your concern is about using 'invalid' 16-bit Unicodes, this can be ignored.
Component Pascal does not do any checks when converting a character code to a CHAR.
Why should the Utf8 converter be more strict than the Component Pascal language?
I am convinced that it is enough for us to follow the definition in Wikipedia
and do a simple and efficient conversion for all 16-bit characters.
Note that the conversion is used heavily by the compiler. You don't want to
introduce meaningless checks there and slow down the compiler.
Helmut, your version has another drawback.
It optimizes the error detection instead of optimizing the successful cases.
Look at "ELSIF ch < 0C0X". In my version this is only executed if a 2-byte sequence is detected.
In your version this is always executed and thereby slows down the conversion of 3-byte sequences.
To summarize the current state, IMHO neither Ivan's nor your version is
an improvement over my version.
Josef Templ wrote:Component Pascal does not do any checks when converting a character code to a CHAR.
Why should the Utf8 converter be more strict than the Component Pascal language?
NO!! This converter exported, so it can be used not only for BlackBox inside work, but also in module Strings and for hundreds of other tasks. UTF-8 is everywhere. So we should think about this.
Ivan, please keep it simple. You are inventing problems that don't exist.
Our Utf-8 converters convert any 'Valid' Component-Pascal string into Utf-8 AND back.
In Component Pascal a string is valid if it is 0X terminated.
Since Component-Pascal does not restrict the character codes, why should the Utf-8 converter?
With your approach you end up for some strings in a legal StringToUtf8 conversion that
cannot be converted back by Utf8ToString. This is really strange
and the alternative is so obvious and so simple.
I decided to delete all error detection except buffer overflow (res = 1).
Because:
- Format errors inside identifier does not occur
- The procedure is much faster
- The procedure is easier to understand
Back to the root is the best solution.