transforms and language identifiers (was Re: Dozenal chars in music)

From: Julian Bradfield (jcb+unicode@inf.ed.ac.uk)
Date: Sun May 24 2009 - 17:10:25 CDT

  • Next message: David Starner: "Re: Dozenal chars in music"

    >Changed the name to better reflect the subject.

    And it's even working back to Unicode, via locales!

    >Well, what can I say? Perhaps I am 'silly', but first, you assume that 'en'
    >is defined as in the Ethnologue; and in this case, we don't; we follow IETF

    That was an assumption which I put into the first message! On the
    basis that ISO doesn't define it. I admit that I'd never heard of BCPs
    - I'm sure more people know about Ethnologue than BCP 47, by some
    orders of magnitude, so I view best current practice as following
    Ethnologue.

    >It is fine to root for the home team (or English variant), but UK English is
    >not currently the most common form of English. And who knows, at some time
    >in the future en-IN may be the most common form of English.

    The most common is probably already that version of English spoken by
    Chinese learners!

    >First, that is not how locale models work; programming language subclassing
    >is not particularly analogous to the situation. What locale models do is
    >give the user "the best shot". If I ask for en-UK-x-Yorkshire, it gives me
    >the most common variant of Yorkshire if available, otherwise the most common
    >variant of en-UK, otherwise the most common variant of en. That way you get
    >something as close to the request as possible; and your user doesn't just
    >get a 404 <http://en.wikipedia.org/wiki/HTTP_404> if you don't have an exact
    >match. A second part of that locale model is that you can also query (in
    >APIs) what was returned, and decide on that basis if you want to do
    >something special (like tell the user that they aren't getting an exact
    >match, or throw them a 404 <http://en.wikipedia.org/wiki/HTTP_404>). Google
    >"locale inheritance model" for more info.

    That's still a covariant situation: the user's asking for an output
    locale, and you're saying "can't do it, here's something that is at
    least a cousin of your requested type, are you happy?". Your
    transformation program is contravariant in its input. It takes input,
    which may be specified as en-UK, or en-SG, or whatever, transforms it
    to the output locale fonipa, and silently *gives the wrong answer* --
    not a "best shot" answer with notification. (An especially unfortunate
    answer, since the GA /fɑks/ sounds to British ears more like RP /fʌks/
    than /fɒks/.) Unless the output "locale" is labelled as being a
    representation of the "en-US" version of the input, the user isn't
    getting the information you claim that the locale models should give -
    and if you do label the output as being a representation of en-US,
    then you might better declare up front that the input locale is en-US.

    >Secondly, of course, all language tags are approximations. The code en-UK is
    >not uniform in denotation. If you mean RP, according to the BL, it is spoken
    >by as few as 2% of the UK population (
    >http://www.bl.uk/learning/langlit/sounds/case-studies/received-pronunciation/).

    Of course. But most en-UK speakers accept RP as a reference standard
    pronunciation, although they no longer consider it a normative
    standard. Likewise people accept GA as an American reference standard,
    not a normative standard.

    I think it's not entirely clear whether UK or US English is viewed as
    the reference standard for English, if you're only interested in
    numbers. US clearly dominates the native-English-speaking world, but
    probably many of the L2 English speakers still think of UK English as
    a nearer reference standard than US, especially in those places where
    there are many L2 speakers.

    >You say "If you claim to transform from en to X, your transformation should
    >be correct for anything that is en." If we followed your argument, the
    >transformation for en-UK should be correct for anything that is "en-UK". By
    >that account, one couldn't use "en-UK" either in mapping to IPA, since it is
    >not completely determinant; it means any of the variants of English as
    >spoken in the UK. If we followed your logic to the bitter end, we'd have to
    >specify down to the very narrow dialect, maybe even idiolect. That's simply
    >not a practical model.

    No, that argument doesn't fly, because the output may be at a level (a
    broad phonemic transcription) that covers all of en-UK. (Some dialects
    make distinctions that RP doesn't, so you'd need to make those
    distinctions to get it really right.) Indeed, such a transcription
    could also suffice for GA - you can convert from RP to GA pretty
    well. But a GA transcription has less information than an RP
    transcription, so can't be transformed to be right for RP.
    Similarly, such an transcription should include all the /r/s, even
    those that non-rhotic speakers (e.g. RP) don't pronounce, because non-rhotic
    speakers can remove the /r/s, but rhotic speakers can't insert /r/s
    that aren't there in the transcription.

    In fact, your system already does some of this: it tranforms
    When will Merry Mary marry?
    to
    wɛn wɪl mɛri meri mæri?
    although most Americans don't make the three-way distinction.

    All this kind of stuff has of course been considered ad nauseam in the
    various proposals for "phonetic" orthographies for English. Or even by
    lexicographers - some dictionaries avoid giving separate UK and US
    pronunciations by using a system that can be mapped to either.

    Anyway, perhaps the real issue is that doing en-ipa as an example of
    Unicode transliteration is a weird idea! IPA is about transcription of
    spoken language, not transliteration of written language. Transforming
    from en to ipa by transcribing some random dialectal pronunciation of
    the written input is on a par with transforming from en to fr by translating
    it, which is surely beyond the scope of Unicode transforms!

    >Thirdly, you use the phrasing: "you **must** include the subtype". That
    >presumes some kind of consequence. Examples:

    ... or you give the user the wrong answer without telling them so.

    -- 
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.
    


    This archive was generated by hypermail 2.1.5 : Sun May 24 2009 - 17:13:35 CDT