Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Fri Apr 26 2002 - 08:40:04 EDT

Previous message: Dan Kogai: "Unicode::Unihan 0.01 uploaded to CPAN"
Next in thread: Michael Everson: "Re: Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)"
Reply: Michael Everson: "Re: Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)"
Reply: Peter_Constable@sil.org: "Re: Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)"
Reply: Kenneth Whistler: "Re: Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I have now modified my idea for a system and put this system forward for
discussion. A copy of the original document is appended.

----

Consider please that there exists for the Private Use Area the concept of the hexadecimal point. The term "hexadecimal point" is similar to the concept of a decimal point, the difference being that a decimal point is for base 10 numbers and a hexadecimal point is for base 16 numbers.

A classification system could regard all characters that are defined within it to have a code point value that is a real number, consisting of a part to the left of the hexadecimal point that is a value from the Private Use Area range of code points, and a part to the right of the hexadecimal point that is a value that is assigned as part of the method of registering the characters as being included in the classification system. So, for example, if a set of characters for some particular script P is registered, it might be registered as having a part to the right of the hexadecimal point of, say, 1005 so that if the characters were placed at U+E000 through to U+E0FF then they would be regarded within the classification system as being at A+E000.1005 through to A+E0FF.1005 at integer spaced intervals.

Four hexadecimal places would seem to be a good balance between having scope and avoiding complexity, with .0 being unused for allocating to characters and having a meaning of "undefined". However, the possibility of more than four hexadecimal places is kept open so that later expansion is always possible.

Although phrased as a part to the right of the hexadecimal point, these agreed system codes are really designations of trays of character designs; however, the use of the hexadecimal point is convenient for expressing an individual character in the form, for example, of A+E023.1005 so that its meaning is uniquely defined within the system.

The classification system would be to use primarily U+E000 through to U+EFFF for defining blocks of up to 4096 characters and then using U+F35B to mean the start of defining the part to the right of the hexadecimal point and U+F35D to mean the end of defining the part to the right of the hexadecimal point. The characters U+F330 through to U+F339 are then used to provide digits and U+F341 through to U+F346 are used to provide letters for expressing the hexadecimal values. In founts that implement scripts that are codified within the Private Use Area using this classification system, these characters are set as being zero width so that they do not display in a document. There would, however, be the possibility of having a general fount that is provided specifically as an analysis tool to be used when trying to detect such a type tray sequence within a plain text file where the characters are from the Private Use Area and the coding is as yet unknown by the analyst. In this case the characters would be displayed as follows. U+F35B as a LEFT SQUARE BRACKET AUGMENTED WITH A SQUARE. U+F35D as a RIGHT SQUARE BRACKET AUGMENTED WITH A SQUARE. Digits along the pattern of U+F330 as DIGIT ZERO WITH A SQUARE BENEATH. Letters along the pattern of U+F342 as LATIN CAPITAL LETTER B WITH A SQUARE BENEATH. For the avoidance of doubt, these are open squares, not filled squares. For the brackets, the square would be a small square superimposed on top of the vertical line of the bracket symbol, centred on the centre of that vertical line.

A plain text file could indicate, for uses involving use of the classification system, the use of the particular script P mentioned above using the following sequence of characters.

U+F35B U+F331 U+F330 U+F330 U+F335 U+F35D

All characters in the Private Use Area would be presumed to have that part to the right of the hexadecimal point until another sequence starting U+F35B were received. This classification system would be primarily intended to be used for characters in the range U+E000 through to U+EFFF, yet is defined for all Private Use Area characters, including those in planes 15 and 16, for completeness. This means that the classification system can apply to those character designs that use U+F000 through to U+F0FF. Codes starting with a D to the right of the hexadecimal point could be allocated to them for permanent use. For example, some particular such fount Q might be designated to have, say, D157 to the right of the hexadecimal point.

The use of this hexadecimal point technique would allow characters from several different character sets to be used in the same plain text file if so desired.

Naturally, it would be best to avoid using characters in the range U+F300 through to U+F3FF for characters within type trays defined using this system. I have chosen the U+F3.. block for this suggested classification system as I am unaware of any existing use of that area. If anyone knows of any present use of the U+F3.. block I would be pleased to know of that use, however, as it may be difficult to find any part of the Private Use Area that is not being used for something by somebody somewhere, perhaps any such overlap will just have to exist as a known limitation of the classification system.

The classification system could also include codes for characters that are never intended to become standard Unicode characters yet for which a universal designation would be helpful.

The matter of keeping the system up to date could be partly resolved by, for scripts that are under consideration for inclusion in Unicode, a time out of validity of a year and a day after the publication of some particular version number of the Unicode specification. If necessary the time out could be extended if a decision about whether to include the characters in Unicode has not been reached by the time of that version of Unicode being published, or, if necessary the validity could be made permanent if the decision is not to include the characters in Unicode.

I feel that such a classification system would be very helpful and potentially of great usefulness.

I feel that such a classification system for the Private Use Area should be established and I am interested to write it up and publish it and to participate in allocations of type tray codes to those people who would like to have a code allocated.

I recognize that this classification system will not be codified as part of the Unicode system as such, unless the meanings of the codes in the U+F3.. block that I have suggested become promoted to regular Unicode codes at some future date. As to whether they could be promoted, that is a debatable point, yet I suggest that regular Unicode code points that designate an optional classification system that may but need not be used in conjunction with the Private Use Area would not be endorsing any "assignment to a particular set of characters" but simply allowing a more rigorous classification system to be used than can be used where the codes used to produce the classification system are within the area that is being classified.

However, whilst recognizing that having the codes used to produce the classification system of the Private Use Area within the Private Use Area itself could lead to problems, I feel that, with care, the suggested classification system could be used to provide a workable system that could be of good use in practice.

William Overington

26 April 2002

-----Original Message----- From: William Overington <WOverington@ngo.globalnet.co.uk> To: unicode@unicode.org <unicode@unicode.org> Cc: Patrick Rourke <ptrourke@methymna.com>; Doug Ewell <dewell@adelphia.net>; archive@ngo.globalnet.co.uk <archive@ngo.globalnet.co.uk> Date: Wednesday, March 13, 2002 12:50 PM Subject: Re: Private Use Agreements and Unapproved Characters

>Here is a system that I think would work. > >Consider please that there exists for the private use area the concept of >the hexadecimal point. The term "hexadecimal point" is similar to the >concept of a decimal point, the difference being that a decimal point is for >base 10 numbers and a hexadecimal point is for base 16 numbers. > >An agreed system could regard all characters that are defined within it to >have a code point value that is a real number, consisting of a part to the >left of the hexadecimal point that is a value from the private use area >range of code points, and a part to the right of the hexadecimal point that >is a value that is assigned as part of the method of registering the >characters as being included in the agreed system. So, for example, if a >set of characters for some particular script P is registered, it might be >registered as having a part to the right of the hexadecimal point of, say, >1005 so that if the characters were placed at U+E000 through to U+E0FF then >they would be regarded within the agreed system as being at A+E000.1005 >through to A+E0FF.1005 at integer spaced intervals. > >Four hexadecimal places would seem to be a good balance between having scope >and avoiding complexity, with .0 being unused for allocating to characters >and having a meaning of "undefined". > >One possibility for the agreed system would be to use U+E000 through to >U+EFFF for defining blocks of up to 4096 characters and then using two >characters from the range U+F000 through to U+F8FF to mean the start and the >end of defining the part to the right of the hexadecimal point. I am aware >at the back of my mind that some of the characters in the range U+F000 >through to U+F8FF are often used for a particular type of user defined fount >such as dingbat type things, so I wonder if someone could please say if they >know of that matter so that any suggestions for defining these start and end >of defining codes does not clash with that usage. Indeed that usage could >be included into the agreed system and codes starting with a D to the right >of the hexadecimal point could be allocated to them for permanent use. For >example, some particular such fount Q might be designated to have, say, >D157 to the right of the hexadecimal point. > >Suppose though, on a temporary basis herein pending resolution of that >matter, that within the agreed system U+F000 were to mean the start of >defining the part to the right of the hexadecimal point and U+F001 were to >mean the end of defining the part to the right of the hexadecimal point, >then a plain text file could indicate, for uses involving use of the agreed >system, the use of the particular script P mentioned above using the >following sequence of characters. > >U+F000 U+0031 U+0030 U+0030 U+0035 U+F001 > >All characters in the private use area would be presumed to have that part >to the right of the hexadecimal point until another sequence starting U+F000 >were received. > >The use of this hexadecimal point technique would allow characters from >several different character sets to be used in the same plain text file. > >The agreed system could also include codes for characters that are never >intended to become standard Unicode characters yet for which a universal >designation would be helpful. These character sets could use designations >starting with some character such as C to the right of the hexadecimal >point. > >Although phrased as a part to the right of the hexadecimal point, these >agreed system codes are really designations of trays of character designs; >however, the use of the hexadecimal point is convenient for expressing an >individual character in the form, for example, of A+E023.1005 so that its >meaning is uniquely defined within the system. > >Also, by using a part to the right of the hexadecimal point, the system has >unlimited scope. > >The matter of keeping the system up to date could be partly resolved by, for >scripts that are under consideration for inclusion in Unicode, a time out of >validity of a year and a day after the publication of some particular >version number of the Unicode specification. If necessary the time out >could be extended if a decision about whether to include the characters in >Unicode has not been reached by the time of that version of Unicode being >published, or, if necessary the validity could be made permanent if the >decision is not to include the characters in Unicode. > >I feel that such an agreed system would be very helpful and potentially of >great usefulness. > >The next matter is as to what is meant by agreed in the phrase agreed >system. > >I feel that if the matter is discussed here in this discussion forum then >whatever consensus exists when the discussion hopefully reaches a consensus >could be taken as the agreed system. Please know that although the phrase >"private agreement" is used in the specification in the section about the >private use area, later in that section the word "published" is used, so one >does not, in fact, need any agreement at all, it is quite permissible to >simply publish one's own suggested system. Naturally, the more agreement >amongst those people who express an interest that one can achieve the better >that is, yet I feel that the best way forward is to discuss a system and >then proceed by taking on board such comments that are received that can be >accommodated in the system and then publishing a system and starting to use >that system and then anyone who so wishes may participate in the use of that >published system. > >William Overington > >13 March 2002 > > > > > > > > > >

This archive was generated by hypermail 2.1.2 : Fri Apr 26 2002 - 09:38:40 EDT