Re: ASCII control codes in sequences of multibyte character sets

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 2 Sep 2013 22:02:21 +0200

They are meaning the same, provided that a byte is an octet (not always
true : this historically was different within some old computers that had
9-bit or 6-bit bytes, but today we are deang with 10-bit bytes on networks,
and with bytes with variable -size encoding, thr size being tunable
depending on reliability factors ; even storage mediums no longer store
bytes with 8-bit only).

So we should see the "byte" as the minimum logical unit of individually
adressable information (and the world industry chose to make it 8-bit only
in all modern standards) for interoperability. But octets are still needed
for low-level description of physical protocols.

In my opinion, it is non-sense to speak about "multi-octet" character
enoding, but "multi-byte" is also very fuzzy. TUS prefers speaking about
encodings that use "code units" (or arbitrary size, but supporting at least
the range of distinct integers). Depending on standards and the level at
which they operate, the terminology changes. But they are basically the
same. As almost all encodings outside standard UTF's are now legacy and
dying, replaced by standad UTF's, we should use the terminology defined in
TUS and for the UCS in ISO 10646 for all encodings.

Let's forget "multi-byte" or "multi-octet". "Multi-byte" is just linked to
old POSIX standard libraries for C/C++ (which is fact should have been
named "multi-char", not "multi-character", for these languages, given the
meaning of "char" in C or C++). "Multi-octet" is used in former ISO
encodings based on 8-bit code units but specified as the minimum storage
requirement (additional bits may be needed and these ISO standards do not
specify how they are mapped to physical encoding space, notably on storage
or in transmission, and they do not give them any numerical value, these
are just "identifiers", or coordinates in an 8-dimensional binary vector
space, with arbitrary axis, but they don't have any arithmetic properties
bound to them).

2013/9/2 Doug Ewell <doug_at_ewellic.org>

> SteffenDaodeNurpmeso wrote:
>
> |If you count fixed length (>1) character sets as multibyte, you can
>> |add UCS-2 and UTF-32.
>>
>> Yes, but no :), i would count those as multi-octet rather than
>> multibyte character sets.
>>
>
> How would you define the difference between multi-octet and multi-byte?
>
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ­
>
>
Received on Mon Sep 02 2013 - 15:05:08 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 02 2013 - 15:05:09 CDT