L2/08-085
Report of the South Asia Subcommittee Meeting in Chennai, 2008/01/23-24
draft
The following is a report on the South Asia Subcommittee Meeting in Chennai.
Initial discussions
The following are informal notes that summarize the address from the IT Secretary of the Tamil Nadu Government. The
exact words will be attached later.
The government is concerned about some things:
1. Errors in encoding such as non-Tamil characters being in the encoding
2. Efficiencies - Govt of Tamil Nadu is undertaking a massive e-governance
efforts. Huge digital libraries are coming up and govt doesn’t want to migrate these massive
databases in the future. Govt relies on the experts in the task force and 13 meetings have been held
to review and analyze TACE encoding.
3. There may be legal issues as well and the government has to be very careful.
The government's position is that one block for TACE-16 in Unicode would be
desirable. The feasibility and practicality need to be investigated. Urged UTC /UC to look at and
suggest ways to resolve the issues. The problems are genuine. Tamil is an international language.
Used in official transaction in Sri Lanka. International ramifications. We have proposed our
solution. We would like UTC’s recommendations.
As for Tamil Nadu Government, it intends to accept the
recommendations of the task force and declare a standard and expects the Government of India’s
support as well.
- We do appreciate that a lot of work has been done.
- Appreciate everyone's being as frank and forthcoming as possible.
- What are the objections, stance of the UC.
- Also need to consider the Tamil Community’s stance
- The UC's views are vital, and can suggest better solutions if available
- We would like international community’s opinion to make an informed decision
- Announced creation of a fund to ease the migration from the old to new.
- Some teething problem from old to new
- Migration path can be considered.
The government of TN wants to hold a conference in coordination with INFITT.
Opening remarks from Mark Davis; main points in summary:
- The UC also has the goal of making Tamil work correctly; we look forward to
working with the TNG and the GOI to do this.
- An important issue is stability - Unicode is a bit like the banking system -
the confidence of members and implementers in the stability is key to its success.
- Not breaking existing implementations is vital
Discussion
- General agreement about the need to improve the situation for Tamil Users
- Key problem is the lack of implementations of Tamil, and the correctness of
those implementations. There were many examples of the need to improve.
- For the meeting, we'll be following the South Asia subcommittee charter as per
the August UTC meeting, copied below.
Step 1: Identification of the issues
Discussion of TACE-16 - proposal for 347 new characters replacing
Tamil block. Following issues were raised as part of that presentation.
- Significant work since the May meeting on identifying the issues, developing concrete tests
- The task force on TACE-16, which are technical advisors to the Government of Tamil Nadu, has
recommended to the government the All Character Tamil encoding (TACE-16) be made a government
standard to meet the following requirements:
- Handle the emerging needs of
e-governance,
- Produce unambiguous and legally indisputable digital records of government
documents
- Enable creation of documents that will stand the test of time,
- Be Independent from any external shaping engine
- Be efficient in desktop publishing, linguistic and natural language processing
- Assure safe, unambiguous browsing resistant to domain name spoofing
- Issues Raised
- What users think of as characters (syllabic)
- Ambiguous encodings in Unicode (length marks)
- Unicode characters not in collation order
- Simplification of natural-language processing
- Dependence on correct rendering engines
- Fonts not having correct OpenType tables
- Variable levels of support for OpenType fonts
- Efficiency in storage and processing
- See attached report for more details.
- Conclusion by TACE task force that Unicode does not match the user's perceptions of characters, and is less
optimal than TACE-16 for the measured operations.
In discussion, an issue was raised:
- the results need to be reproducible: eg, data made available and at least
pseudo-code for the operations
Discussion of Unicode principles relevant to above (Mark Davis,
Michael Kaplan)
- Unicode character is coded entity ≠ what user thinks of as "letter" or "character". Many examples from variety of scripts.
This is true for languages other than Tamil as well. (e.g. Swedish A with a ring character).
- Canonical equivalence establishes identity; normalization (NFC) used for unambiguous
representation (specifically, the "broken" vowel pieces are combined). Used in important cases
like IDNA.
- Code order ≠ collation order for any language: eg, Z < a
- Display ≠ character codes. Many scripts require more than linear
layout. Some of the errors or inefficiencies may be triggered by problems in correctness of
implementations. For example, collation, rendering, etc. OpenType fonts with ligature tables
for all of the 345 or so Tamil characters identified could be precomposed and mapped to existing
Unicode quite efficiently.
- Storage is an issue, but not predominant (discussion of UTFs, history of UTF-16).
(See also #9)
- Stability is a key issue. Unicode is like the banking system. People have to be able to
trust that it won't change out from under them. Major clients of Unicode are very dependant on
this -- some would much rather have stability than improvements.
- Similarity of models helps with implementation. Perceived difficulty of
implementation only increases if the language deviates from a family model and stands by
itself. There is strength in being part of a family model where only slight modifications are
needed to support a new language. (For example, other Indic scripts.)
- There may be minority letters in script, or "mistaken" characters like
U+0B82
( ஂ
) TAMIL SIGN ANUSVARA which is not used in any language using the Tamil script. Characters can be annotated (as Anusvara is), or
deprecated (stronger), but never removed. (There was discussion of options for this character,
as to whether to annotate or deprecate.) Even the name of the character cannot be changed.
(There are separate data files in the Unicode Character Database with name annotations, more
information, and with correction.) Note: localized names for Tamil characters can be supplied
(eg Pulli vs "VIRAMA", or visarga), so that vendors can display the correct name in programs
like CharMap.
- Results of efficiency vary dramatically according to the code used.
Efficiency in storage/transmission are implementation dependent and algorithms can be carefully
optimized. Efficiency in processing is a desirable goal but if stability of implementation
forces a hit on efficiency that is acceptable. (See also #5)
Notes:
SMP vs BMP
- BMP is from 0000 to FFFF. Most common characters, widely supported
- SMP is from 10000 to 10FFFF. Infrequent characters,
historic scripts. Support in major OS's began a few years ago, but many applications don't fully
support. (Examples: Vista supports plane 1 and 2 (only fonts for plane 2)).
- BMP code points are typically transmitted as 3 UTF-8 octets while SMP
requires 4. In UTF-16, these are 2 bytes for BMP characters, and 4 for SMP characters. (The
difference is not double as might be expected. )
- Space in the SMP is not constrained, whereas space in the BMP is very
confined at this point. In particular, certain areas are reserved for Right-Left characters,
which cannot be changed without serious consequences.
Discussion
- Members of the TACE taskforce disputed the points about the
efficiency/performance issues, and benefits of following the Indic model.
- At the time that Tamil was first encoded, it could have followed a syllabic
model for encoding like Ethiopic has now.
- Implementations quite often may transform Unicode into different internal
formats for processing, such as in doing natural-language processing.
- If TACE were in the SMP, some problems are avoided -- the main blocker is
dual encoding and stability.
- Normalization cannot map old characters to new characters, for stability
constraints. If a new precomposed character were added, then it would normalized back to its
components.
- Unicode operating systems (Windows, Mac, etc) convert to Unicode for
rendering, etc.
Step 2: Evaluation of possible approaches
We started from the bottom up:
Approach D: TACE-16 as a separate IANA-registered character set
Unicode programs would convert on input to Unicode, process, and emit TACE-16 on output.
(Similar to GB 18030.) Non-Unicode programs could process natively.
Pros
- No dependency on Unicode - Tamil Nadu government can do
independently
- TACE16 is very easy to implement; not stateful, easy conversion to and form Unicode
- well-established path for charsets -implementations are used to using them
- governments have strong sway
- the Tamil Nadu government can do exactly what it wants
- useful in any closed environment: examples: cell phone, natural-language processing,
etc.
- well-defined path for programs to support -- programs are used to doing conversion
- if multilingual capabilities are required inside the same codepage, then additional repertoire
would need to be, eg, for English, French, Telugu, Malayalam, Sinhala, etc.
- Example: GB 18030 (China) includes all Unicode characters, with an
algorithmic mapping to Unicode for most characters.
- The simpler the mapping to Unicode, the more likely implementations would
pick it up.
- See iana.org for the list of IANA charsets.
- Other TACE advantages: eg Processing using syllables (eg NLP) would use
single code points.
- On Unicode system, where conversion is done, algorithms depending on Unicode
properties would work: line-breaking, sorting, identifiers, etc.
Cons
- whether it is added to products depends on company's adding the conversion tables.
- for cell-phone environments, 8-bit encoding may be preferred
- uptake by companies will depend on critical mass, so a bit of a chicken and egg problem
- performance issues need investigation
- Typically Unicode programs / OSs will convert to Unicode for rendering, etc.
(Linux may not -- needs investigation.) However, typically performance is not substantially
impacted for rendering.
- Would need to evangelize key players
- ICU, Windows, Java, PHP, Python, Perl, Linux,...
- Many will pick up without further evangelization
- Most are combination of data+algorithms
Approach C: TACE-16 repertoire in the PUA
Pros
- No dependency on Unicode - Tamil Nadu government can do
independently
- Encapsulated in Unicode, so no conversion necessary
- SMP PUA is unencumbered - TACE-16 group could establish precedent (homesteading)
- Compression of SMP works well
- Rendering would be straightforward.
- Other TACE advantages: eg Processing using syllables (eg NLP) would use
single code points.
Cons
- BMP PUA is in wide use for ideographs already, so it probably wouldn't be
practical. (needs investigation, there might be enough room)
- Overlap problem - some others could use code points for different purpose
- Many implementations, & all old implementations, will treat as unknown characters
(impacting anything dependant on properties: line-breaking, sorting, identifiers, etc). No standard Unicode properties, so algorithms driven by them won't work
- Conversions are needed for interfacing with standards that require standard
Unicode. For example, IDNs will be in standard Unicode, requiring a conversion.
Discussion:
- legal implications of PUA:
- If the Tamil Nadu government established a standard, then being a
standard for legal purposes is not an issue.
- For legal purposes, people need to use final-form document
with embedded fonts, for any language.
- Font issues are not specific to PUA - can have font-spoofing in either way.
Approach CD: Approach C, plus register it with IANA
as a charset.
- Mixture of advantages and disadvantages of above.
- Examples:
- No dependency on Unicode - Tamil Nadu government can do
independently
- In some cases, TACE would convert to and from Unicode; in others it could be
interpreted natively.
- Character properties would be available; all multilingual capabilities would
be present;
- IANA pros and cons from D.
Approach B: TACE-16 repertoire added to Unicode
TACE-16 task force investigated different approaches (listed above, and with full report
attached). Major choices are BMP vs
SMP. The TACE task force would like to see TACE in the BMP; failing that, the SMP would be an
acceptable backup.
Pros
- See attached document
Cons
- Unless current Unicode model can be shown to not be able to represent Tamil, the duplicate
encoding and stability principles would prevent addition.
- Accommodating TACE in the BMP would require moving the reserved RTL (U+0800
.. U+08FF) code point range. (Space is not an issue for the SMP.)
- The suggestion from the TACE group is to move the reserved RTL area to
- Arabic extensions to U+18B0 .. U+18FF
- Mandaic to U+A8E0 .. U+A8FF
- Samaritan to U+AB50 .. U+AB7F
- Sorang Sng to U+A4D0 .. U+A4FF
Approach B1: Add only "pure consonants" to Unicode
This would be adding what is currently represented as <consonant + pulli> as
precomposed characters to the current Tamil block in Unicode.
Pros
- Pure consonants represent 30% of the letter frequency in Tamil text
- Possible performance benefits in collation, text size (for unnormalized text)
Cons
- Would be introducing new precomposed characters
- Normalization would replace the new characters with the current ones.
Key Areas where governments, industry, and Unicode can help
There is a natural frustration with programs not being able to handle Tamil, or having errors.
Discussed common techniques companies use in prioritizing their work on different languages, and how
to leverage improvements.
No matter what approach is taken, common need for the following
(draft list)
- Identify problems in key application programs and set up communication with vendors
- Core set of open source (individual and commercial use) high-quality fonts
- Freely available keyboard specifications and IMEs
- Central place for developers to go for help with Tamil (on Unicode site or Government site,
perhaps wiki?)
- Up-to-date locale data (eg CLDR)
- Need to investigate having standard ligature table for OpenType to map
Unicode sequences to TACE syllables.
Side issue: the Tamil numbers are almost archaic, and offer opportunities
for spoofing, so are discouraged for identifiers such as IDN.
Discussion of Unicode Locales Project (CLDR)
- (not able to do for lack of time)
We wish to thank our hosts, the Tamil Virtual University and Government of Tamil Nadu
South Asia Charter for Tamil Discussion (L2/07-272, item 10)
Goal: ensure that Unicode meets the needs for representation and processing of Tamil.
This may or may not require the encoding of new characters. Any recommendation should exhaustively
examine the implications, including on existing data, on existing software (processing, display,
etc), on education about the standard, on consistency of model for theIndic and other South Asian
scripts.
The scope of the subcommittee is to review the issues and to make recommendations to the UTC.
Step 1: Identification of the issues
Identify the issues (problems or perceived problems) with the current representation. Determine
whether they are issues with the standard itself (encoding, properties, or algorithms) or with
implementations. Determine the nature of the issues: technical, perceptual or educational.
Candidate issues:
1 disconnect of the code chart with the user expectations
2 efficiency in storage/transmission
3 efficiency in processing
4 correctness of implementations
5 difficulty of implementation
Step 2: Evaluation of possible approaches
This enumeration of possible approaches does not preclude the examination of other approaches (which
may extend on or combine the approaches below). The questions listed for each approach are
illustrative of the kinds of questions that need to be answered for a proper evaluation of the
approach; they are not exhaustive.
Approach A: current model
How would those issues be addressed with the current representation? Are there any enhancements (new
characters, changes to properties, addition of properties, guidelines, documentation in the
standard) that would alleviate those issues?
Approach B: TACE-16 repertoire added to Unicode
How would adding the TACE-16 repertoire to Unicode address those issues? And what would be the new
problems created by the introduction of that repertoire?
For example:
• dual encoding and stability policy
• does it need to be in the BMP, and if so, how does it fit there?
• would encoding in a non-contiguous area help or hurt compression techniques?
Approach C: TACE-16 repertoire in the PUA
What are the issues that applications are faced with?
For example:
• collisions with other well-established PUA uses, such as CJK:
- there is not always an "official" mapping, different vendors do different things
- PUA conflicts:
HKSCS 9571 (U+2721B) → U+E78D
GB18030 A6D9 (,) → U+E78D
- PUA differentiation:
HKSCS 8BFA (U+20087) → U+F572
GB18030 FE51 (U+20087) → U+E816
• PUA characters cannot be used in IDN.
Approach D: TACE-16 as a separate IANA-registered character set
How simple is it to add support for a new character set (with a well-defined mapping to the existing
Tamil block) to exisiting Unicode-based applications? Can this be done in a timely manner, across
enough products to achieve viable workflows? What are the implications for already shipped software?
U+0B82
( ஂ ) TAMIL SIGN ANUSVARA
U+0B83
(
ஃ ) TAMIL SIGN VISARGA
U+0B85
(
அ ) TAMIL LETTER A
U+0B86
(
ஆ ) TAMIL LETTER AA
U+0B87
(
இ ) TAMIL LETTER I
U+0B88
(
ஈ ) TAMIL LETTER II
U+0B89
(
உ ) TAMIL LETTER U
U+0B8A
(
ஊ ) TAMIL LETTER UU
U+0B8E
(
எ ) TAMIL LETTER E
U+0B8F
(
ஏ ) TAMIL LETTER EE
U+0B90
(
ஐ ) TAMIL LETTER AI
U+0B92
(
ஒ ) TAMIL LETTER O
U+0B93
(
ஓ ) TAMIL LETTER OO
U+0B94
(
ஔ ) TAMIL LETTER AU
U+0B95
(
க ) TAMIL LETTER KA
U+0B99
(
ங ) TAMIL LETTER NGA
U+0B9A
(
ச ) TAMIL LETTER CA
U+0B9C
(
ஜ ) TAMIL LETTER JA
U+0B9E
(
ஞ ) TAMIL LETTER NYA
U+0B9F
(
ட ) TAMIL LETTER TTA
U+0BA3
(
ண ) TAMIL LETTER NNA
U+0BA4
(
த ) TAMIL LETTER TA
U+0BA8
(
ந ) TAMIL LETTER NA
U+0BA9
(
ன ) TAMIL LETTER NNNA
U+0BAA
(
ப ) TAMIL LETTER PA
U+0BAE
(
ம ) TAMIL LETTER MA
U+0BAF
(
ய ) TAMIL LETTER YA
U+0BB0
(
ர ) TAMIL LETTER RA
U+0BB1
(
ற ) TAMIL LETTER RRA
U+0BB2
(
ல ) TAMIL LETTER LA
U+0BB3
(
ள ) TAMIL LETTER LLA
U+0BB4
(
ழ ) TAMIL LETTER LLLA
U+0BB5
(
வ ) TAMIL LETTER VA
U+0BB6
(
ஶ ) TAMIL LETTER SHA
U+0BB7
(
ஷ ) TAMIL LETTER SSA
U+0BB8
(
ஸ ) TAMIL LETTER SA
U+0BB9
(
ஹ ) TAMIL LETTER HA
U+0BBE
(
ா ) TAMIL VOWEL SIGN AA
U+0BBF
(
ி ) TAMIL VOWEL SIGN I
U+0BC0
(
ீ ) TAMIL VOWEL SIGN II
U+0BC1
(
ு ) TAMIL VOWEL SIGN U
U+0BC2
(
ூ ) TAMIL VOWEL SIGN UU
U+0BC6
(
ெ ) TAMIL VOWEL SIGN E
U+0BC7
(
ே ) TAMIL VOWEL SIGN EE
U+0BC8
(
ை ) TAMIL VOWEL SIGN AI
U+0BCA
(
ொ ) TAMIL VOWEL SIGN O
U+0BCB
(
ோ ) TAMIL VOWEL SIGN OO
U+0BCC
(
ௌ ) TAMIL VOWEL SIGN AU
U+0BCD
(
் ) TAMIL SIGN VIRAMA
U+0BD7
(
ௗ ) TAMIL AU LENGTH MARK
U+0BE6
(
௦ ) TAMIL DIGIT ZERO
U+0BE7
(
௧ ) TAMIL DIGIT ONE
U+0BE8
(
௨ ) TAMIL DIGIT TWO
U+0BE9
(
௩ ) TAMIL DIGIT THREE
U+0BEA
(
௪ ) TAMIL DIGIT FOUR
U+0BEB
(
௫ ) TAMIL DIGIT FIVE
U+0BEC
(
௬ ) TAMIL DIGIT SIX
U+0BED
(
௭ ) TAMIL DIGIT SEVEN
U+0BEE
(
௮ ) TAMIL DIGIT EIGHT
U+0BEF
(
௯ ) TAMIL DIGIT NINE
U+0BF0
(
௰ ) TAMIL NUMBER TEN
U+0BF1
(
௱ ) TAMIL NUMBER ONE HUNDRED
U+0BF2
(
௲ ) TAMIL NUMBER ONE THOUSAND
U+0BF3
(
௳ ) TAMIL DAY SIGN
U+0BF4
(
௴ ) TAMIL MONTH SIGN
U+0BF5
(
௵ ) TAMIL YEAR SIGN
U+0BF6
(
௶ ) TAMIL DEBIT SIGN
U+0BF7
(
௷ ) TAMIL CREDIT SIGN
U+0BF8
(
௸ ) TAMIL AS ABOVE SIGN
U+0BF9
(
௹ ) TAMIL RUPEE SIGN
U+0BFA
(
௺ ) TAMIL NUMBER SIGN