Re: Counting Codepoints

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 13 Oct 2015 16:16:47 +0200

This works in Java because Java also treats surrogates as characters, even
if it has additional APIs to test strings for their actual encoding length
for Unicode. But outside strings, characters are just integers mathing
their code point value, and are not restricted to be valid Unicode
characters (strings also are not restricted to UTF-16 validation). Java
strings are not UTF-16 strings, they are just streams of unsigned 16-bit
code units, with arbitrary values and relative order (so ill-formed strings
for Unicode are still valid Java strings).
When UTF-16 validity is required, your examples with loops would have to
test the presence of lone surrogates in the returned code points. Such
detection is needed for implementing some protocols, e.g. to parse HTML
pages and check the encoding (or guess it) and the input stream would then
be parsed with another encoding countring codepoints differently.
For I/O, the 16-bit "char" type.is actually not used, I/O is performed with
signed "byte"s, they are decoded using a specific encoding which will
return errors or exceptions if decoded into strings, or for the reverse
operation which can also fail).

2015-10-13 14:08 GMT+02:00 Mark Davis ☕️ <mark_at_macchiato.com>:

>
> On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
>> Rather the question must be the unwieldy one of how
>> many scalar values and lone surrogates it contains in total.
>>
>
> ​That may be the question in theory; in practice no programming language
> is going to support APIs like that. So the question is whether your
> original question was purely theoretical, or was about some particular
> language/environment.
>
> If the latter, then looking at the behavior of related functions in that
> environment, like traversing a string, and counting in a way that is most
> consistent with their behavior, is the least likely to cause problems.
>
> For example, Java is pretty consistent; each of the following returns 2 as
> the count.
>
> String test = "\uDC00\uD800\uDC20";
> int count = test.codePointCount(0, test.length());
> *System.out.println("codePointCount:\t" + count);*
>
> count = 0;
> int cp;
> for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
> cp = test.codePointAt(i);
> count++;
> }
> *System.out.println("Java 7 iteration:\t" + count);*
>
> count = 0;
> for (int cp2 : test.codePoints().toArray()) {
> count++;
> }
> *System.out.println("Java 8 iteration:\t" + count);*
>
> // for the last, could just call: *count = (int)
> test.codePoints().count();*
>
> The isolate surrogate code unit is
> ​consistently treated
> as the corresponding surrogate code point, which is what
> ​anyone would
>
> ​reasonably ​
> expect.
>
> Mark
>
Received on Tue Oct 13 2015 - 09:18:34 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 09:18:35 CDT