Peter continued:
> OK, both you and John mentioned identifiers. Let me ask a slightly
> different question: I'm thinking about all of our linquists who have
> existing data containing 0x27 to represent a glottal stop (some possibly
> also using it as a quotation mark / apostrophe), and I'm thinking about
> getting them migrating to using Unicode. I know that it would be good for
> them to encode this orthographic representation of glottal stop as U+02BC,
> but if they also use 0x27 for a quotation mark, it may be not so trivial
> to get their data converted correctly, and many might be inclined to just
> map 0x27 > U+0027. I'm trying to think of reasons to give them as to why
> they might not want to do this, and usability for identifiers isn't going
> to particularly grab the attention of many of them.
>
> So, why might a linguist want to go through the extra effort to map 0x27 >
> U+02BC in exactly those contexts when it should map to this and not U+2019
> or something else?
This is just the computer-age version of the age-old question as
to why a linguist would want to distinguish anything that functions
differently.
For years back in the late 70's and early 80's, before I got my
first PC, I typed up index slips with a manual typewriter. That
manual typewriter had various custom keys welded on, so that I could
get schwas, open-o's, lambda's, dead-key commas above, and the like.
To do so, it eliminated various "dispensable" keys. Among the
dispensables were "1" and "0" (odd choices to be missing for a
future computer guy -- but I digress). So my only option was to
type "l" for "1" and "O" for "0". That worked fine for my slips,
because I knew the difference. It also worked fine for correspondence,
although it was a bit hinky. And the post office didn't care for
addresses I typed that way. (They still don't, as a matter of fact.)
But if I took all that data and coded it for entry in a modern
database, would I keep my "l"'s and "O"'s for "1"'s and "0"'s?
Of course not. Because I know the difference, and wouldn't want them
mixed up for computational use.
Now take your linguists. If they have been using ASCII 0x27 for
both a single quote and for glottal stop, they are just the next
step along in overloading functionally different characters.
[In fact, if they've been using Macintoshes all along, they shouldn't
be in this quandary, since they could/should have been using
0xD4/0xD5 (= U+2018/U+2019) for their single quotation marks. If
so, then they would already be distinguished from an 0x27 used for
a glottal stop.] If I were in their shoes and were being offered
a conversion to a character encoding that enabled me to make
the difference systematically, I'd be working to filter my data
to do the right thing -- since I'd care about the data integrity
in the more capable representation. Of course, they might also
prefer to actually make use of U+0294 LATIN LETTER GLOTTAL STOP,
but that would depend, in part, on how entrenched the orthography
using the apostrophe-shaped letters is.
By the way, if identifiers won't grab the attention of your linguists,
then consider a kindred operation: word selection. A properly
implemented word selection should select *inside* quotation marks,
but should include any glottal stops in a word. If your orthography
hopelessly mixes up the two, then your system isn't likely to give very
appropriate word selection feedback.
--Ken
This archive was generated by hypermail 2.1.2 : Mon Mar 25 2002 - 20:22:44 EST