Re: Obsolete characters

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Jan 22 2009 - 15:40:25 CST

Next message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Re: Obsolete characters"

Previous message: Michael Everson: "Re: Obsolete characters"
In reply to: Michael Everson: "Re: Obsolete characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark

On Thu, Jan 22, 2009 at 12:28, Michael Everson <everson@evertype.com> wrote:

> On 22 Jan 2009, at 19:37, Mark Davis wrote:
>
> Every time anyone has any kind of organization of anything, it could be
>> considered "prejudical" to someone somewhere.
>>
>
> True, and I understand that your organization may not have the same
> requirements as I might have. Nevertheless, the idea that yogh should be
> "archaic" (or whatever) and not on a character picker, when I enountered it
> in school in Arizona (a pretty ordinary place, in the late 1970s) seems to
> me to be wrong. Chaucer is still on the c
>
> And yet for a character picker there are so very many Unicode characters
>> that you have to have some kind of organization or the characters you want
>> are difficult to find within a morass of others.
>>
>
> So how will people find the "obscure" ones?

First off, and I want to emphasize this again, we are trying different
models of dealing with the data, so we may or may not use the modern vs
non-modern feature of individual characters in modern scripts. I've passed
out a link to the current mockup, so you can see some of the things we are
trying, but should not at all take it as final.

What we are doing there is having the 'uncommon' characters available, just
separated from the others. As I said, this may or may not work.

>
>
> We considered UCA ordering, the ordering in the charts on
>> http://www.unicode.org/charts/collation/
>>
>> There are a few problems with that, as you will see if you take a look.
>> • Especially in the case of symbols or punctuation, it is hard to
>> find things
>>
>
> Why start here?

"start"?

> I was talking about letters of the alphabet(s). You've been dividing Latin
> up into ins and outs.

It isn't ins-and-outs; both are available.

> Symbols should (if possible) be found by category, but that was outside of
> the scope of the UCD.

If you mean the General Category, that is too gross a categorization for
symbols.

I also assume that by UCD (Unicode Character Database) you actually mean UCA
(Unicode Collation Algorithm). If you do actually mean UCD, then you'd have
to say which properties you are talking about, since there are over 70.

> I do think punctuation should be together. If not, what do you get?

"be together" is underspecified. "together with other punctuation,
regardless of script" or "together with the scripts it is used with". What
we are trying is script-specific punctuation / symbols being with the
letters and others being together in general script / punctuation
categories. Because this is not a partition, we can put characters used in
multiple scripts in multiple times.

>
> • The interleaving of compatibility characters also makes it
>> difficult (take Arabic, for example).
>>
>
> You could leave most of those out of course. This is a little bit reductio
> ad absurdum.

No, it isn't. You find it prejudicial to even have a yogh in a different
category, yet are willing to drop 750 Arabic characters.

Now, I agree that these are not of the same class as other Arabic characters
(or even yogh), but we are trying to have a framework where someone could
pick any Unicode character.

> Your list mostly consisted of Latin letters, including phonetic ones. Those
> aren't the same category as the Arabic presentation forms, for goodness'
> sake.

The reason my list consisted of those is that those are ones where the
status is not easily derivable from Unicode data -- thus my circulation of
this list.

>
>
> • The ordering of scripts is arbitrary.
>>
>
> So? I imagine Latin ought to come first, but even if it came second, my
> request is that the Latin characters all be presented in UCD order (at a
> mininimum).

Because depending on how one counts "Latin", that would be over 2K
characters. The following are in UCA order, with contiguous ranges elided
with '-':

[⒜-⒵ ℃ ℉ ↀ-ↂ ↆ-ↈ ↅ a ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ⓐAＡ𝐀𝐴𝑨𝒜
𝓐𝔄𝔸𝕬𝖠𝗔𝘈𝘼𝙰ⒶªᵃₐᴬáÁàÀăĂắ ẮằẰẵẴẳẲâÂấẤầẦẫẪẩẨǎǍåÅ ÅǻǺäÄǟǞãÃȧȦǡǠąĄāĀảẢȁȀ
ȃȂạẠặẶậẬḁḀ ㏂ ℀ ℁ ㏟ ꜳꜲ æÆᴭǽǼǣǢ ꜵꜴ ꜷꜶ㍳ ꜹꜸꜻꜺ ꜽ Ꜽ ẚ ᴀ ⱥȺ ᶏ ᴁ ᴂᵆ ɐⱯᵄ ɑ Ɑᵅ ᶐ ɒᶛ
bｂ𝐛𝑏𝒃𝒷𝓫𝔟𝕓𝖇𝖻𝗯𝘣 𝙗𝚋ⓑBＢℬ𝐁𝐵𝑩𝓑𝔅𝔹𝕭𝖡𝗕𝘉𝘽𝙱Ⓑᵇᴮ ḃḂḅḄḇḆ ㍴㏃ ʙ ƀɃ
ᴯ ᴃ ᵬ ᶀ ɓƁ ƃƂ cｃⅽ𝐜𝑐𝒄𝒸𝓬𝔠𝕔𝖈𝖼𝗰 𝘤𝙘𝚌ⓒCＣⅭℂℭ𝐂𝐶𝑪𝒞𝓒𝕮𝖢𝗖𝘊𝘾𝙲Ⓒ
ᶜćĆĉĈčČċĊçÇḉḈ ℅ ℆ ㏆㎈㏄㏅㎝㎠㎤㏇ ᴄ ȼȻ ƈƇ ɕ ᶝ ↄↃ ꜿꜾ dｄⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽
𝗱𝘥𝙙𝚍ⓓDＤⅮⅅ𝐃𝐷𝑫𝒟𝓓𝔇𝔻𝕯𝖣𝗗𝘋𝘿 𝙳ⒹᵈᴰďĎḋḊḑḐḍḌḓḒḏḎđĐðÐᶞ ꝺꝹ ㍲ ȸ㏈㎗㍷-㍹
ǳʣǲǱǆǅǄ ʥ ʤ ᴅ ᴆ ᵭ ᶁ ɖƉ ɗƊ ᶑ ƌ Ƌ ȡ ꝱ ẟ eｅℯⅇ𝐞𝑒𝒆𝓮𝔢𝕖𝖊𝖾𝗲
𝘦𝙚𝚎ⓔEＥℰ𝐄𝐸𝑬𝓔𝔈𝔼𝕰𝖤𝗘𝘌𝙀𝙴Ⓔᵉ ₑᴱéÉèÈĕĔêÊếẾềỀễỄểỂěĚë
ËẽẼėĖȩȨḝḜęĘēĒḗḖḕḔẻẺȅȄ ȇȆẹẸệỆḙḘḛḚ ㋍㋎ ᴇ ɇɆ ᶒ ⱸ ǝƎᴲ ⱻ əƏᵊₔ ᶕ ɛƐℇᵋ ᶓ ɘ ɚ ɜᶟ ᶔ
ᴈᵌ ɝ ɞ ʚ ɤ f ｆ𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ⓕFＦℱ𝐅𝐹𝑭 𝓕𝔉𝔽𝕱𝖥𝗙𝘍𝙁𝙵ⒻᶠḟḞꝼꝻ
℻ ﬀ ﬃ ﬄ ﬁ ﬂ ㎙ ʩ ꜰ ᵮ ᶂ ƒƑ ⅎℲ ꟻ gｇℊ𝐠𝑔𝒈𝓰𝔤𝕘𝖌𝗀𝗴𝘨𝙜𝚐ⓖGＧ𝐆
𝐺𝑮𝒢𝓖𝔊𝔾𝕲𝖦𝗚𝘎𝙂𝙶ⒼᵍᴳǵǴğĞĝĜ ǧǦġĠģĢḡḠᵹꝽ ㏿㎇㎓㎬㏉ ɡᶢ ɢ ǥǤ ᶃ ɠƓ ʛ ᵷ ꝿꝾ ɣ
Ɣˠ ƣƢ hｈℎ𝐡𝒉𝒽𝓱𝔥𝕙𝖍𝗁𝗵𝘩𝙝𝚑 ⓗHＨℋ-ℍ𝐇𝐻𝑯𝓗𝕳𝖧𝗛𝘏𝙃𝙷ⒽʰᴴĥĤ
ȟȞḧḦḣḢḩḨḥḤḫḪẖħℏĦ ㏊㋌㏋㍱㎐ ʜ ƕǶ ɦʱ ⱨⱧ ⱶⱵ ꜧ Ꜧ ɧ iｉⅰℹⅈ𝐢𝑖𝒊𝒾𝓲𝔦𝕚𝖎𝗂𝗶𝘪𝙞
𝚒ⓘIＩⅠℐ ℑ𝐈𝐼𝑰𝓘𝕀𝕴𝖨𝗜𝘐𝙄𝙸Ⓘⁱ ᵢᴵíÍìÌĭĬîÎǐǏïÏḯḮĩĨİįĮ īĪỉỈȉȈȋȊịỊḭḬ ⅱⅡ ⅲⅢ
ĳĲ ㏌㍺ ⅳⅣ ⅸⅨ ı𝚤 ɪᶦ ꟾ ᴉᵎ ɨƗᶤ ᵻᶧ ᶖ ɩƖᶥ ᵼ jｊⅉ𝐣𝑗𝒋
𝒿𝓳𝔧𝕛𝖏𝗃𝗷𝘫𝙟𝚓ⓙJＪ𝐉𝐽𝑱𝒥𝓙𝔍𝕁𝕵 𝖩𝗝𝘑𝙅𝙹ⒿʲⱼᴶĵĴǰ ȷ𝚥 ᴊ ɉɈ ʝᶨ ɟᶡ ʄ
kｋ𝐤𝑘𝒌𝓀𝓴𝔨𝕜𝖐𝗄𝗸𝘬 𝙠𝚔ⓚKKＫ𝐊𝐾𝑲𝒦𝓚𝔎𝕂𝕶𝖪𝗞𝘒𝙆𝙺Ⓚᵏ ᴷḱḰǩǨķĶḳḲḵḴ ㎄
㎅㎉㎏㎑㏍㎘㎞㏎㎢㎦㎪㏏㎸㎾ ᴋ ᶄ ƙƘ ⱪⱩ ꝁꝀ ꝃꝂ ꝅꝄ ʞ lｌ
ⅼℓ𝐥𝑙𝒍𝓁𝓵𝔩𝕝𝖑𝗅𝗹𝘭𝙡𝚕ⓛLＬⅬℒ𝐋 𝐿𝑳𝓛𝔏𝕃𝕷𝖫𝗟𝘓𝙇𝙻ⓁˡᴸĺĹľĽļĻḷ
ḶḹḸḽḼḻḺłŁŀĿ ǉǈǇ Ỻ ỻ ㏐-㏒ ʪ ㋏㏓ ʫ ʟᶫ ꝇꝆ ᴌ ꝉꝈ ƚȽ ⱡⱠ ɫⱢ ɬ ᶅᶪ ɭᶩ ȴ ꝲ ɮ ꞁ Ꞁ ƛ ʎ
mｍⅿ𝐦𝑚𝒎𝓂𝓶𝔪𝕞𝖒𝗆𝗺𝘮𝙢 𝚖ⓜMＭⅯℳ𝐌𝑀𝑴𝓜𝔐𝕄𝕸𝖬𝗠𝘔𝙈𝙼Ⓜᵐᴹ ḿḾṁṀṃṂ ㎧㎨㎡
㎥㎃㏔㎆㎎㎒㏕㎖㎜㎟㎣㏖㎫㎳㎷㎹㎽㎿ ᴍ ᵯ ᶆ ɱⱮᶬ ꟽ ꟿ ꝳ nｎ
𝐧𝑛𝒏𝓃𝓷𝔫𝕟𝖓𝗇𝗻𝘯𝙣𝚗ⓝNＮℕ𝐍𝑁𝑵𝒩 𝓝𝔑𝕹𝖭𝗡𝘕𝙉𝙽ⓃⁿᴺńŃǹǸňŇñÑṅṄ
ņŅṇṆṋṊṉṈ ㎁㎋ ǌǋǊ ㎚ № ㎱㎵㎻ ɴᶰ ᴻ ᴎ ᵰ ɲƝᶮ ƞȠ ᶇ ɳᶯ ȵ ꝴ ŋŊᵑ oｏℴ𝐨𝑜𝒐𝓸𝔬
𝕠𝖔𝗈𝗼𝘰𝙤𝚘ⓞOＯ𝐎𝑂𝑶𝒪𝓞𝔒𝕆𝕺𝖮𝗢𝘖 𝙊𝙾ⓄºᵒₒᴼóÓòÒŏŎôÔốỐồỒỗỖ
ổỔǒǑöÖȫȪőŐõÕṍṌṏṎȭȬȯȮȱ ȰøØǿǾǫǪǭǬōŌṓṒṑṐỏỎȍȌȏȎ ơƠớỚờỜỡỠởỞợỢọỌộỘ œŒ ꝏ Ꝏ ㍵ ᴏ ᴑ ɶ
ᴔ ᴓ ɔƆᵓ ᴐ ᴒ ᶗ ꝍꝌ ᴖᵔ ᴗᵕ ⱺ ɵƟᶱ ꝋꝊ ɷ ȣȢᴽ ᴕ pｐ𝐩𝑝𝒑𝓅𝓹𝔭𝕡𝖕𝗉𝗽𝘱𝙥𝚙
ⓟPＰℙ𝐏𝑃𝑷𝒫𝓟𝔓𝕻𝖯𝗣𝘗𝙋𝙿ⓅᵖᴾṕṔ ṗṖ ㏘㎀㎩㍶㎊㏗㏙㏚㎰㉐㎴㎺ ᴘ ᵽⱣ ꝑꝐ ᵱ ᶈ ƥƤ
ꝓꝒ ꝕꝔ ꟼ ɸᶲ ⱷ qｑ𝐪𝑞𝒒𝓆𝓺𝔮𝕢𝖖𝗊 𝗾𝘲𝙦𝚚ⓠQＱℚ𝐐𝑄𝑸𝒬𝓠𝔔𝕼𝖰𝗤𝘘𝙌𝚀Ⓠ ȹ ꝗꝖ
ꝙꝘ ʠ ɋɊ ĸ rｒ𝐫𝑟𝒓𝓇 𝓻𝔯𝕣𝖗𝗋𝗿𝘳𝙧𝚛ⓡRＲℛ-ℝ𝐑𝑅𝑹𝓡𝕽𝖱
𝗥𝘙𝙍𝚁ⓇʳᵣᴿŕŔřŘṙṘŗŖȑȐȓȒṛ ṚṝṜṟṞꞃꞂ ㎭-㎯ ₨ ʀƦ ꝛꝚ ᴙ ɍɌ ᵲ ɹʴ ᴚ ɺ ᶉ ɻʵ ⱹ ɼ ɽⱤ ɾ ᵳ
ɿ ʁʶ ꝵ ꝶ ꝝꝜ sｓ 𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ⓢSＳ𝐒𝑆𝑺𝒮𝓢
𝔖𝕊𝕾𝖲𝗦𝘚𝙎𝚂ⓈˢśŚṥṤŝŜšŠṧṦṡ ṠşŞṣṢṩṨșȘſꞅꞄẛ ℠ ㏛ ßẞ ﬆﬅ ㏜ ꜱ ᵴ ᶊ ʂᶳ ȿ ẜ ẝ ʃ Ʃᶴ
ᶋ ƪ ʅ ᶘ ʆ tｔ𝐭𝑡𝒕𝓉𝓽𝔱 𝕥𝖙𝗍𝘁𝘵𝙩𝚝ⓣTＴ𝐓𝑇𝑻𝒯𝓣𝔗𝕋𝕿𝖳𝗧𝘛
𝙏𝚃ⓉᵗᵀťŤẗṫṪţŢṭṬțȚṱṰṯṮꞇ Ꞇ ʨ ℡ ᵺ ㎔ ™ ƾʦ ʧ ꜩꜨ ᴛ ŧŦ ⱦȾ ᵵ ƫᶵ ƭƬ ʈƮ ȶ ꝷ ʇ
uｕ𝐮𝑢𝒖𝓊𝓾𝔲𝕦𝖚𝗎𝘂𝘶𝙪𝚞ⓤUＵ𝐔 𝑈𝑼𝒰𝓤𝔘𝕌𝖀𝖴𝗨𝘜𝙐𝚄ⓊᵘᵤᵁúÚùÙŭ
ŬûÛǔǓůŮüÜǘǗǜǛǚǙǖǕűŰũŨ ṹṸųŲūŪṻṺủỦȕȔȗȖưƯứỨừỪữ ỮửỬựỰụỤṳṲṷṶṵṴ ᴜᶸ ᴝᵙ ᴞ ᵫ ʉɄᶶ ᵾ ᶙ
ɥᶣ ʮ ʯ ɯƜᵚ ᴟ ɰᶭ ʊƱᶷ ᵿ vｖⅴ𝐯𝑣𝒗𝓋𝓿𝔳𝕧
𝖛𝗏𝘃𝘷𝙫𝚟ⓥVＶⅤ𝐕𝑉𝑽𝒱𝓥𝔙𝕍𝖁𝖵𝗩𝘝 𝙑𝚅ⓋᵛᵥⱽṽṼṿṾ ㏞ ⅵⅥ ⅶⅦ ⅷⅧ ꝡꝠ ᴠ ꝟꝞ ᶌ ʋƲᶹ
ⱱ ⱴ ỽỼ ʌɅᶺ wｗ𝐰𝑤𝒘𝓌𝔀𝔴𝕨𝖜𝗐𝘄𝘸𝙬𝚠ⓦW Ｗ𝐖𝑊𝑾𝒲𝓦𝔚𝕎𝖂𝖶𝗪𝘞𝙒𝚆ⓌʷᵂẃẂẁẀ
ŵŴẘẅẄẇẆẉẈ ㏝ ᴡ ⱳⱲ ʍ xｘ ⅹ𝐱𝑥𝒙𝓍𝔁𝔵𝕩𝖝𝗑𝘅𝘹𝙭𝚡ⓧXＸⅩ𝐗𝑋𝑿
𝒳𝓧𝔛𝕏𝖃𝖷𝗫𝘟𝙓𝚇ⓍˣₓẍẌẋẊ ⅺⅪ ⅻⅫ ᶍ yｙ𝐲𝑦𝒚𝓎𝔂𝔶𝕪𝖞𝗒𝘆𝘺𝙮𝚢ⓨY
Ｙ𝐘𝑌𝒀𝒴𝓨𝔜𝕐𝖄𝖸𝗬𝘠𝙔𝚈ⓎʸýÝỳỲŷ ŶẙÿŸỹỸẏẎȳȲỷỶỵỴ ʏ ɏɎ ƴ Ƴ ỿỾ
zｚ𝐳𝑧𝒛𝓏𝔃𝔷𝕫𝖟𝗓𝘇𝘻𝙯𝚣ⓩZ Ｚℤℨ𝐙𝑍𝒁𝒵𝓩𝖅𝖹𝗭𝘡𝙕𝚉ⓏᶻźŹẑẐž ŽżŻẓẒẕẔ ƍ ᴢ
ƶƵ ᵶ ᶎ ȥȤ ʐᶼ ʑᶽ ɀ ⱬⱫ ꝣꝢ ʒƷᶾǯǮ ᴣ ƹƸ ᶚ ƺ ʓ ȝȜ þÞ ꝥꝤ ꝧꝦ ƿǷ ꝩꝨ ꝫꝪ ꝭꝬ ꝯꝮꝰ ꝸ ƻ ꜫ Ꜫ
ꜭꜬ ꜯꜮ ƨƧ ƽƼ ƅƄ ʔ ɂɁ ŉ ꜣꜢ ꞌꞋ ʕˤ ᴤ ᴥᵜ ꜥꜤ ʡ ʢ ʖ ǀ-ǃ ʗ ʘ ʬ ʭ]
------------------------------

>
>
> • Using UCA order significantly increases the bandwidth
>> requirements. (And carbon footprint ;-)
>> http://googleblog.blogspot.com/2009/01/powering-google-search.html)
>>
>
> Awww come on. Really? You don't have to GENERATE the UCA order on the fly.
> Right now you're making a table, aren't you? So you generate the order once
> for the UCA and then that table gets loaded.

Software is not magic, nor are tables. We do, of course, generate the
tables staticly, because we can't depend on javascript sorting. For the
buttons within each set, we can choose to have them be in UCA order, or not.
We currently use UCA order for most modern scripts, except for Han.

The tables currently compress 350,076 bytes worth of characters (UTF-8) to
21,403 bytes. If we sort all the buttons by UCA, then it grows to 37,815
bytes. If we sort only by codepoint, it is 14,910. So we are already taking
a certain hit for doing as much sorting as we are.

>
> I think it's neat what you're doing, but I don't see that any of the
> reasons you've given address my concern at being able to find all the Latin
> characters easily.

It is clearly an open issue as to whether a categorization by modern or not
is useful for sets like Latin - we're trying different tactics to see what
works and what doesn't. Having accurate information on whether characters
are "customary modern use" or not can help us to see whether that approach
works or not. And you'd be one of the people who could certainly help in
that regard, so feedback you have is appreciated.

>
> Michael Everson * http://www.evertype.com
>
>
>
>

Next message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Re: Obsolete characters"
Previous message: Michael Everson: "Re: Obsolete characters"
In reply to: Michael Everson: "Re: Obsolete characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 22 2009 - 15:43:15 CST