L2/12-283
Source: Mark Davis
Subject: Handling fake Gershayim and Geresh in Hebrew words (UAX #29)
Date: 2012/07/29
Proposed Change
Create a PRI for the following proposed
change to UAX #29 in 6.2.1.
Accommodate the use of " and ' in
default Hebrew word break. The changes would consist of the following:
1. Create a property value for
Hebrew_Letter (HLetter), for Single_Quote (SQuote), and Double_Quote (DQuote).
2. Add rules:
-
HLetter × SQuote
-
HLetter × DQuote HLetter
-
HLetter (SQuote | DQuote) × HLetter
3. Change every other rule as
follows:
-
ALetter to be (ALetter | HLetter)
-
Mid_Num_Let to be (Mid_Num_Let | SQuote)
Background
When writing Hebrew, it is common
practice to use ASCII " and ' instead of the correct characters. However,
while those behave correctly in the default Unicode line break, they don't
behave correctly in the default Unicode word break. The problem arises when
there is Hebrew text in the midst of another language, so the other
language's word break is being used.
There are
pros and cons to this change. It is a very language-specific change, and we
certainly don't want to push all the language-specific changes down to
root. On the other hand, other than some minor additional complexity, it
shouldn't hurt any other locale; the script makes this unambiguous. So we'd
like a PRI item for this to consider whether or not the change would be
warranted.
The problem arises in these two
cases:
While the following case works fine already, and needs no change.
The Geresh-equivalent (') can occur medially and finally, while
the Gershayim-equivalent (") can occur only medially.