Unicode Utilities: Description and Index

help | character | properties | confusables | unicode-set | compare-sets | regex | bnf-regex | breaks | transform | bidi | idna | languageid

Boundaries

  • Breaks Demonstrates different boundaries within text.
    • Enter the sample text.
    • Pick the kind of boundaries, or hit Test.
  • Regex Shows transformation of (Java) Regex pattern to support Unicode.
    • Enter the regex pattern
    • Change the sample text if desired.
    • Click Show Modified Regex Pattern
    You'll then see the modified pattern. It will often be much larger, but any reasonable Regex engine will compile character classes reasonably. Below that, you'll see a sample of how the expression works, using it to find substrings of the sample text and underline them.

Properties

  • Unicode Property Demo window
    • Enter a character code in the right side, and hit Show. You'll see the properties for that character (where they have non-default values).
    • If you click on any property (like Age), you'll see a list of all the properties and their values in the Unicode Property List window
    • If you click on any property value in either of these two windows, like 4.0.0.0 for Age, you'll see the characters with that property in the UnicodeSets Demo window
  • UnicodeSet Demo window
    • You can put in arbitrary UnicodeSets, allowing boolean combinations of any of the property+value combinations in the Unicode Property List window
    • If you click on Compare at the top, you can compare any two UnicodeSets.

Transforms


UnicodeSet

UnicodeSets use regular-expression syntax to allow for arbitrary set operations (Union, Intersection, Difference) on sets of Unicode characters. The base sets can be specified explicitly, such as [a-m w-z], or using Unicode Properties like [[:script=arabic:]&[:decompositiontype=canonical:]]. The latter set gets the Arabic script characters that have a canonical decomposition. The properties can be specified either with Perl-style notation (\p{script=arabic}) or with POSIX-style notation ([:script=arabic:]). For more information, see ICU UnicodeSet Documentation.

In the online demo, the implementation of UnicodeSet is customized in the following ways.

  1. Query Use. The UnicodeSet can be typed in, or used as a URL query parameter, such as the following. Note that in that case, "&" needs to be replaced by "%26".
  2. Regular Expressions. For the name property, regular expressions can be used for the value, enclosed in /.../. For example in the following expression, the first term will select all those Unicode characters whose names contain "CJK". The rest of the expression will then subtract the ideographic characters, showing that these can be used in arbitrary combinations.

    Some particularly useful regex features are:

    Caveats:

    1. The regex uses the standard Java Pattern. In particular, it does not have the extended functions in UnicodeSet, nor is it up-to-date with the latest Unicode. So be aware that you shouldn't depend on properties inside of the /.../ pattern.
    2. The Unassigned, Surrogate, and Private Use code points are skipped in the Regex comparison, so [:Block=/Aegean_Numbers/:] returns a different number of characters than [:Block=Aegean_Numbers:], because it skips Unassigned code points.
    3. None of the normal "loose matching" is enabled. So [:Block=aegeannumbers:] works, but [:Block=/aegeannumbers/:] fails -- you have to use [:Block=/Aegean_Numbers/:] or [:Block=/(?i)aegean_numbers/:].
  3. Casing Properties. Unicode defines a number of string casing functions in Section 3.13 Default Case Algorithms. These string functions can also be applied to single characters. Warning: the first three sets may be somewhat misleading: isLowercase means that the character is the same as its lowercase version, which includes all uncased characters. To get those characters that are cased characters and lowercase, use [[:isLowercase:]&[:isCased:]]
    1. The binary testing operations take no argument:
    2. The string functions are also provided, and require an argument. For example:

      Note: The Unassigned, Surrogate, and Private Use code points are skipped in generation of the sets.

  4. Normalization Properties. Unicode defines a number of string normalization functions UAX #15. These string functions can also be applied to single characters.
    1. The binary testing operations have somewhat odd constructions:
    2. The string functions are also provided, and require an argument. For example:

      Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of the sets.

  5. IDNA Properties. The status of characters with respect to IDNA (internationalized domain names) can also be determined. The available properties are listed below.
    1. [:idna=output:] The set of all characters allowed in the output of IDNA. An example is
      • U+00E0 ( à ) LATIN SMALL LETTER A WITH GRAVE
    2. [:idna=ignored:] The set of all characters ignored by IDNA on input. That is, these characters are mapped to nothing -- removed -- by NamePrep. An example is:
    3. [:idna=remapped:] The set of characters remapped to other characters by IDNA (NamePrep). Examples are:
      • U+00C0 ( À ) LATIN CAPITAL LETTER A WITH GRAVE (remapped to the lowercase version).
      • U+FF21 ( A ) FULLWIDTH LATIN CAPITAL LETTER A
    4. [:idna=disallowed:] These are characters disallowed (on the registry side) by IDNA. An example is:

      Note: The client side adds characters unassigned in Unicode 3.2, for compatibility. To see just the characters disallowed in Unicode 3.2, you can use [[:idna=disallowed:]&[:age=3.2:]]. To also remove private-use, unassigned, surrogates, and controls, use [[:idna=disallowed:]&[:age=3.2:]-[:c:]].


Fonts and Display. If you don't have a good set of Unicode fonts (and modern browser), you may not be able to read some of the characters. Some suggested fonts that you can add for coverage are: Unicode Fonts for Ancient Scripts, Noto Fonts site, Large, multi-script Unicode fonts. See also: Unicode Display Problems.

Version 3.7; ICU version: 56.0.1.0; Unicode version: 8.0.0.0