From: Henrik Theiling (theiling@absint.com)
Date: Fri Apr 11 2008 - 15:21:07 CDT
Hi!
Kenneth writes:
>...
> Second, any such restriction would have to be written into
> ISO/IEC 10646, as well as the Unicode Standard. I can tell
> you from experience that it was a considerable problem getting
> even the limited constraints now documented to consensus for
> documentation in 10646, and getting that through ballots and
> publication. National Bodies are (justifiably, I think) concerned and
> worried about algorithmic constraints on their ability to
> name things, particularly when the constraints get complicated
> to the point that they can't remember all the details or
> envision being able to check manually for uniqueness.
>...
Ah, I see. Sounds pretty futile. I think I will not propose it then,
because I am not really in the mood for tilting at windmills.
(My IT-influenced brain of course is used to algorithmic constraints
on my ability to name things, so I personally don't see any problem,
but I can understand that other people think differently. :-))
> ... I have run into similar data from another point of view -- in
> examining the Unicode names list for redundancies that allow
> creation of specialized algorithms to pack it down into much smaller
> storage without making use of generic compression algorithms like
> LZW. ...
Actually, that was exactly what I was doing: building a hash table
mapping from hash value of normalised user input to a squeezed format
of full names. That's where the frequency data came from. The data
structure is a) for recognition of user input and b) expansion to
official names.
>> For stability reasons, it would be very nice if we knew that upcoming
>> Unicode versions had the same nice unambiguity, because then I could
>> officially ignore those words so my users could enjoy more concise
>> character names.
>
> It is unlikely that the UTC or WG2 will depart significantly from
> the patterns they already have in naming characters. And that
> means that you'd likely be pretty safe in assuming you could
> ignore (and or delete) such redundant terms when doing name
> recognition. ...
The problem is that I am writing a specification for a file format
that must be safe wrt. changes: if my specification or Unicode will be
upgraded, I want to guarantee that files written for old specs are
either cleanly rejected (so they can be fixed) or cleanly accepted
with the exact same semantics they have been written in. So 'pretty
safe' does not feel safe enough.
This means that I'll just use the simplifications from TR#34 and try
to be happy with it.
Thanks to you, Mark, and Michael for helping me!
**Henrik
This archive was generated by hypermail 2.1.5 : Fri Apr 11 2008 - 15:24:06 CDT