The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Aug 01, 2014 9:28 pm

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 1 post ] 
Author Message
 Post subject: How do I find all "Latin" characters?
PostPosted: Sun Dec 26, 2010 8:35 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 186
This is a question that comes up repeatedly.

There are several answers. Let me start, and others might chime in with additional considerations.

The most recent version of the Unicode Character Database (UCD) resides at http://unicode.org/Public/UNIDATA/.

You could grep for the word "LATIN" in the file UnicodeData.txt (in the field before the first semicolon). That would give you all the characters with the word "LATIN" in their character name.

You could grep for "Latin" in the file Scripts.txt (case-sensitive grep). That would give you all the characters with a formal assignment of "Latin" as their script property.

Both methods would exclude many characters that are used with the Latin script (such as the digits, the punctuation characters, etc, which have script value of "Common" because they are used with many scripts). The former method is likely to include characters, such as symbols, that are derived from the shape of a Latin character. And combining marks, even U+0363 COMBINING LATIN SMALL LETTER A, don't have the script property "Latin", but instead "Inherited".

There are other differences: For example, U+00AA FEMININE ORDINAL INDICATOR is part of the Latin script, (it looks like a raised "a") but doesn't have the word "LATIN" in its character name.

Some characters are shared among a few scripts. These are documented in Script Extensions.txt, but so far, there are no entries for Latin characters in this file.

You might get the idea that you could look a the file Blocks.txt, and get a list of all the character blocks with the word "Latin" in their name. You could argue, that characters used with Latin, but named using the word Latin, would likely be found in these blocks. You would be partially correct.

You would get digits and the common punctuation marks, but also symbol characters like U+00A5 YEN SIGN, or U+00B5 MICRO SIGN. The latter is based on the Greek character mu.

You would miss, however, many characters, including the combining marks for diacritical marks used with Latin characters, the modifier letters and many others that are clearly "Latin" characters.

Finally, you can familiarize yourself with CLDR, the Common Locale Data Repository, which you can find at http://unicode.org/cldr. CLDR provides lists of characters that are used with certain languages.

So, in summary, the question "what are all the Latin characters?" is not a very well defined question. For many other scripts, it is much easier to get a well defined answer to this question. Why the difference? Latin has been used to write many languages, is used extensively in phonetic and other notations, and, over time, has acquired a pretty large repertoire.

It was also present in many early encodings, which further influence the way it is encoded "spread out" across many blocks in the standard.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 1 post ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com