From: Stephane Bortzmeyer (bortzmeyer@nic.fr)
Date: Thu Dec 26 2002 - 04:15:08 EST
Those who will want to actually use it may see the libstringprep
library <URL:gttp://www.josefsson.org/libstringprep/>.
Network Working Group P. Hoffman
Request for Comments: 3454 IMC & VPNC
Category: Standards Track M. Blanchet
Viagenie
December 2002
Preparation of Internationalized Strings ("stringprep")
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2002). All Rights Reserved.
Abstract
This document describes a framework for preparing Unicode text
strings in order to increase the likelihood that string input and
string comparison work in ways that make sense for typical users
throughout the world. The stringprep protocol is useful for protocol
identifier values, company and personal names, internationalized
domain names, and other text strings.
This document does not specify how protocols should prepare text
strings. Protocols must create profiles of stringprep in order to
fully specify the processing options.
Table of Contents
1. Introduction....................................................3
1.1 Terminology..................................................4
1.2 Using stringprep in protocols................................4
2. Preparation Overview............................................6
3. Mapping.........................................................7
3.1 Commonly mapped to nothing...................................7
3.2 Case folding.................................................8
4. Normalization...................................................9
5. Prohibited Output..............................................10
5.1 Space characters............................................11
5.2 Control characters..........................................11
5.3 Private use.................................................12
5.4 Non-character code points...................................12
5.5 Surrogate codes.............................................13
5.6 Inappropriate for plain text................................13
5.7 Inappropriate for canonical representation..................13
5.8 Change display properties or deprecated.....................13
5.9 Tagging characters..........................................14
6. Bidirectional Characters.......................................14
7. Unassigned Code Points in Stringprep Profiles..................15
7.1 Categories of code points...................................16
7.2 Reasons for difference between stored strings and queries...17
7.3 Versions of applications and stored strings.................18
8. References.....................................................19
8.1 Normative references........................................19
8.2 Informative references......................................19
9. Security Considerations........................................19
9.1 Stringprep-specific security considerations.................19
9.2 Generic Unicode security considerations.....................20
10. IANA Considerations...........................................21
11. Acknowledgements..............................................22
A. Unicode repertoires............................................23
A.1 Unassigned code points in Unicode 3.2.......................23
B. Mapping Tables.................................................31
B.1 Commonly mapped to nothing..................................31
B.2 Mapping for case-folding used with NFKC.....................32
B.3 Mapping for case-folding used with no normalization.........61
C. Prohibition tables.............................................78
C.1 Space characters............................................78
C.1.1 ASCII space characters..................................78
C.1.2 Non-ASCII space characters..............................79
C.2 Control characters..........................................79
C.2.1 ASCII control characters................................79
C.2.2 Non-ASCII control characters............................79
C.3 Private use.................................................80
C.4 Non-character code points...................................80
C.5 Surrogate codes.............................................80
C.6 Inappropriate for plain text................................80
C.7 Inappropriate for canonical representation..................81
C.8 Change display properties or are deprecated.................81
C.9 Tagging characters..........................................81
D. Bidirectional tables...........................................81
D.1 Characters with bidirectional property "R" or "AL"..........81
D.2 Characters with bidirectional property "L"..................82
Authors' Addresses................................................90
Full Copyright Statement..........................................91
1. Introduction
Application programs can display text in many different ways.
Similarly, a user can enter text into an application program in a
myriad of fashions. Internationalized text (that is, text that is
not restricted to the narrow set of US-ASCII characters) has many
input and display behaviors that make it difficult to compare text in
a consistent fashion.
This document specifies a framework of processing rules for Unicode
text. Other protocols can create profiles of these rules; these
profiles will allow users to enter internationalized text strings in
applications and have the highest chance of getting the content of
the strings correct. In this case, "correct" means that if two
different people enter what they think is the same string into two
different input mechanisms, the strings should match on a character-
by-character basis.
This framework does not describe how data is transcoded from other
character sets into Unicode. In systems that uses non-Unicode
character sets, the transcoding algorithm is a critical part of
enabling secure and "correct" operation of internationalized text
strings.
In addition to helping string matching, profiles of stringprep can
also exclude characters that should not normally appear in text that
is used in the protocol. The profile can prevent such characters by
changing the characters to be excluded to other characters, by
removing those characters, or by causing an error if the characters
would appear in the output. For example, because the backspace
character can cause unpredictable display results, a profile can
specify that a string containing a backspace character would cause an
error.
A profile of stringprep converts a single string of input characters
to a string of output characters, or returns an error if the output
string would contain a prohibited character. Stringprep profiles
cannot both emit a string and return an error.
Stringprep profiles cannot account for all of the variations that
might occur or that a user might expect. In particular, a profile
will not be able to account for choice of spellings in all languages
for all scripts because the number of alternative spellings of words
and phrases is immense. Users would probably expect all spelling
equivalents to be made equivalent, or none of them to be. Examples
of spelling equivalents include "theater" vs. "theatre", and
"hemoglobin" vs. "h<U+00E6>moglobin" in American vs. British English.
Other examples are simplified Chinese spellings of names (for
example,"<U+7EDF><U+4E00><U+7801>") vs. the equivalent traditional
Chinese spelling (for example, "<U+7D71><U+4E00><U+78BC>").
Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel",
which are sometimes considered equivalent in German, may not be
considered equivalent in other languages.
...
This archive was generated by hypermail 2.1.5 : Thu Dec 26 2002 - 04:54:47 EST