Technical Reports |
Version | 1.0 |
Editors | Mark Davis |
Date | 2024-11-13 |
This Version | https://www.unicode.org/reports/tr58/tr58-1.html |
Previous Version | none |
Latest Version | https://www.unicode.org/reports/tr58/ |
Latest Proposed Update | https://www.unicode.org/reports/tr58/proposed.html |
Revision | 1 |
This document specifies a standard mechanism for detecting URLs embedded in plain text — in particular, detecting URLs containing non-ASCII characters. It also defines the minimally necessary escaping of non-ASCII code points in the Path, Query, and Fragment portions of a URL that aligns with the mechanism for detecting URLs.
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For more information see About Unicode Technical Reports and the Specifications FAQ. Unicode Technical Reports are governed by the Unicode Terms of Use.
With most email programs, when someone pastes in the plain text:
and sends to someone else, they receive it as:
URLs are also “linkified” in many other applications, such when pasting into a word processor (triggered by typing a space afterwards, for example). However, many products (many text messaging apps, video messaging chats, etc.) completely fail to recognize any non-ASCII characters past the domain name. And even among those that do recognize such non-ASCII characters, there are gratuitous differences in where they stop linkifying.
Linkification is the process of adding links to URLs in plain text input, such as in emails, text messaging, or video meeting chats. The first step in this process is link detection, which is determining the boundaries of spans of text that contain URLs. That substring can then have a link applied to it in output text. The functions that perform these operations are called a linkifier and link detector, respectively.
The specifications for a URL don’t specify how to handle link detection, since they are only concerned with the structure in isolation, not when it is embedded within flowing text. The lack of a clear specification for link detection also causes many implementations to overuse percent escaping for non-ASCII characters when converting URLs into plain text.
Notes
- Following WhatWG URL: Goals, this specification uses the term URL broadly, as including unescaped non-ASCII characters; that is, as utilizing the formal definition of IRIs. See also the W3C's An Introduction to Multilingual Web Addresses.
- In examples, links will be shown with a background color, to make the extent of the linkification clear.
The linkification process for URLs is already fragmented — with different implementations producing very different results — but it is amplified with the addition of non-ASCII characters, which often have very different behavior. That is, developers’ lack of familiarity with the behavior of non-ASCII characters has caused the different implementations of linkification to splinter. Yet non-ASCII characters are very important for readability. People do not want to see the above URL expressed in escaped ASCII:
For example, take the lists of links on List of articles every Wikipedia should have in the available languages. When those are tested with major products, there are significant differences: any two implementations are likely to linkify those differently, such as terminating the linkification at different places, or not linkifying at all. That makes it very difficult to exchange URLs between products within plaintext, which is done surprisingly often — definitely causing problems for implementations that need predictable behavior.
This inconsistency causes problems for users and software companies. Having consistent rules for linkification also has additional benefits, leading to solutions for the following reported problems:
There are many use cases for reducing the % encoding in URLs. For example, consider the common practice of providing user handles with such as:
The first three of these work well in practice. Copying from the address bar and pasting into text provides a readable result. However, with non-ASCII handles (that is, for the majority of the world's population), results in the unreadable https://www.youtube.com/@%E0%A6%AC%E0%A6%B0%E0%A6%BF%E0%A6%B6%E0%A6%BE%E0%A6%87%E0%A6%B2%E0%A7%8D%E0%A6%B2%E0%A6%BE%E0%A6%B9_%E0%A6%AE%E0%A6%A8%E0%A7%81.
If linkification behavior becomes more predictable across platforms and applications, applications will be able to do minimal escaping. For example, in the following only one character would need escaping, the %29 — representing an unmatched “)”:
Providing a consistent, predictable solution that works well across the world’s languages requires a standardized algorithm to define the behavior, and the corresponding Unicode character properties covering all Unicode characters.
Review Note: This draft has not been copy-edited; that is done in later drafts. The Table of Contents will be fleshed out at that point also.
UTS58-C1. For a given version of Unicode, a conformant implementation shall replicate the same link detection results as those produced by Section 3, Link Detection Algorithm.
UTS58-C2. For a given version of Unicode, a conformant implementation shall replicate the same minimal escaping results as those produced by Section 4, Minimal Escaping.
The following table shows the relevant parts of a URL. For clarity, the separator characters are included in the examples. For more information see WhatWG's URL: Example URL Components .
Protocol | Host (incl. Domain) | Port | Path | Query | Fragment |
---|---|---|---|---|---|
https:// | docs.foobar.com | :8000 | /knowledge/area/ | ?name=article&topic=seo | #top |
Note that the Protocol, Port, Path, Query, and Fragment are each optional.
There are two main processes involved in Unicode link detection.
The start of a URL is easy to determine when it has a known protocol (eg, “https://”).
Implementations have also developed heuristics for determining the start of the URL when the protocol is elided, taking advantage of the fact that there are relatively few top-level domains. And those techniques can be easily applied to internationalized domain names, which still have strong limitations on the valid characters. So the end of the domain name is also relatively easy to determine. For more information, see UTS #46, Unicode IDNA Compatibility Processing
The parsing up to the path, query, or fragment is as specified in WhatWG URL: 4.4. URL parsing.
For example, implementations must terminate link detection if a forbidden host code point is encountered, or if the host is a domain and a forbidden domain code point is encountered. Implementations must not linkify if a domain is not a registrable domain. The terms forbidden host code point, forbidden domain code point, and registrable domain are defined in WhatWG URL: Host representation.
For example, an implementation would parse to the end of microsoft.com and google.de, foo.рф, or xn--j1ay.xn--p1ai.
Termination is much more challenging, because of the presence of characters from many different writing systems. While small, hard-coded sets of characters suffice for an ASCII implementation, there are over 150,000 Unicode characters, many with quite different behavior than ASCII. While in theory, almost any Unicode character can occur in certain fields in an URL, in practice many characters have very restricted usage in URLs.
Initiation stops at any Path, Query, or Fragment, so the termination process takes over with a “/”, “?”, or “#” character. Each Path, Query, or Fragment can contain most Unicode characters. The key is to be able to determine, given a Part (such as a Query), when a sequence of characters should cause termination of the link detection, even though that character would be valid in the URL specification.
It is impossible for a link detection algorithm to match user expectations in all circumstances, given the variation in usage of various characters both within and across languages. So the goal is to cover use cases as broadly as possible, recognizing that it will sometimes not match user expectations in certain cases. Exceptional cases (URLs that need to use characters that would terminate) can still be appropriately linkified if those few characters are represented with % escapes.
At a high level, this specification defines three features:
One of the goals is also predictability; it should be relatively easy for users to understand the link detection behavior at a high level.
This specification defines two properties: Link_Termination (LTerm) and Link_Paired_Opener (LOpener).
Link_Termination is an enumerated property of characters with five enumerated values: {Include, Hard, Soft, Close, Open}
Value | Description / Examples |
---|---|
Include | There is no stop before the character; it is included in the link. |
Example: letters | |
Hard | The URL terminates before this character. |
Example: a space
|
|
Soft | The URL terminates before this
character, if it is followed by /\p{lt=Soft}*(\p{lt=Hard}|$)/
|
Example: a question mark | |
Close | If the character is paired with a previous character in the same part (path, query, fragment), it is treated as Include. Otherwise it is treated as Hard. [Review Note: for paths, should this be limited to the same segment between '/' characters?] |
Example: an end parenthesis | |
Open | Used to match Close characters. |
Example: same as under Close |
Link_Paired_Opener is a string property of characters, which for each character in \p{Link_Termination=Close}, returns a character with \p{Link_Termination=Open}.
Example
The specification of the characters with each of these property values is given in Property Assignments.
The termination algorithm assumes that a domain (or other host) has been successfully parsed to the start of a Path, Query, or Fragment, as per the algorithm in WhatWG URL: 3. Hosts (domains and IP addresses) .
This algorithm then processes each final part [path, query, fragment] of the URL in turn. It stops when it encounters a code point that meets one of the terminating conditions and reports the last location in the current part that is still safely considered part of the link. The common terminating conditions are based on the Link_Termination and Link_Paired_Opener properties:
Link_Termination=Hard
character, such as a space.
Within a Path, “?” and “#” are handled as Hard
. Within
a Query, “#’ is handled as Hard
.
Link_Termination=Soft
character, such as a ?
that is followed by a sequence of zero or more Soft
characters, then either a Hard
character or the end of
the text.
Link_Termination=Close
character, such as a
] that does not have a matching Open
character in the same part of the URL. The matching process
uses the Link_Paired_Opener property to determine the correct Open
character, and matches against the top element of a stack of Open
characters.
More formally:
The termination algorithm begins after the Host (and optionally Port) have been parsed, so there is potentially a Path, Query, or Fragment. In the algorithm below, each of those Parts has an initiator character and zero to two hard terminator characters.
Part | initiator | terminators |
---|---|---|
path | '/' | [?#] |
query | '?' | [#] |
fragment | '#' | [] |
Note: cp[i] refers to the ith code point in the string being parsed, cp[start] is the first code point being considered, and n is the length of the string.
For ease of understanding, this algorithm does not include all features of URL parsing, such as ensuring that every % character is followed by two ASCII hex digits.
The algorithm can be optimized in various ways, of course, as long as the results are the same.
The draft property assignments are derived according to the following descriptions. Most characters that cause link termination would still be valid, but require % encoding.
Whitespace, non-characters, format, controls, private-use, surrogates, unassigned,...
Review Notes:
Termination characters and quotation marks:
The contents of the second bullet are expanded in the following table:
Char. | Code Point | Name |
---|---|---|
" | U+0022 |
QUOTATION MARK |
' | U+0027 |
APOSTROPHE |
« | U+00AB |
LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
» | U+00BB |
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
‘ | U+2018 |
LEFT SINGLE QUOTATION MARK |
’ | U+2019 |
RIGHT SINGLE QUOTATION MARK |
‚ | U+201A |
SINGLE LOW-9 QUOTATION MARK |
‛ | U+201B |
SINGLE HIGH-REVERSED-9 QUOTATION MARK |
“ | U+201C |
LEFT DOUBLE QUOTATION MARK |
” | U+201D |
RIGHT DOUBLE QUOTATION MARK |
„ | U+201E |
DOUBLE LOW-9 QUOTATION MARK |
‟ | U+201F |
DOUBLE HIGH-REVERSED-9 QUOTATION MARK |
‹ | U+2039 |
SINGLE LEFT-POINTING ANGLE QUOTATION MARK |
› | U+203A |
SINGLE RIGHT-POINTING ANGLE QUOTATION MARK |
Derived from Link_Paired_Opener property
All other code points
if BidiPairedBracketType(cp) == Close then Link_Paired_Opener(cp) = BidPairedBracket(cp)
else if cp == ">" then Link_Paired_Opener(cp) = "<"
else Link_Paired_Opener(cp) = \x{0}
See Bidi_Paired_Bracket.
The goal is to be able to generate a serialized form of a URL that:
The minimal escaping algorithm is parallel to the linkification algorithm. Basically, when serializing a URL, a character in a Path, Query, or Fragment is only percent-escaped if it is: Hard, Close when unmatched, or Soft when it is the code point in the part.
In the following:
The algorithm can be optimized in various ways, of course, as long as the results are the same. For example, the interior escaping for syntactic characters can be combined into a single pass.
Additional characters can be escaped to reduce confusability, especially when they are confusable with URL syntax characters, such as a Ɂ character in a path. See Security Considerations below.
The security considerations for Path, Query, and Fragment are far less important than for Domain names. See UTS #39: Unicode Security for more information about domain names. The Format characters (\p{Cf}) are categorized as Link_Termination=Hard because they are zero-width and typically invisible. To ensure that users are aware of them, they need to be escaped (and thus visible) to be included in linkification.
Review Note: However, some of the Format characters may be used
sufficiently frequently in text, and in sufficiently well-defined
contexts, that they should instead be
Include
, so that they don't require % escaping in plain text. For example,
we could allow in linkification:
tag_spec
+ U+E007F CANCEL TAG as per UTS #51: Unicode Emoji, C.1 Flag Emoji Tag Sequences.
There are documented cases of how Format characters can be used to sneak malicious instructions into LLMs; see Invisible text that AI chatbots understand and humans can’t? URLs are just a small part of the larger problem of feeding clean text to LLMs, both in building them and in querying them: making sure the text does not have malformed encodings, is in a consistent Unicode Normalization Form (NFC), and so on.
For security implications of URLs in general, see UTS #39: Unicode Security Mechanisms. For related issues, see UTS #55 Unicode Source Code Handling. For display of BIDI URLs, see also HL4 in UAX #9, Unicode Bidirectional Algorithm.
The following lists the draft assignment of Link_Termination and Link_Paired_Opener property values. Although these are embedded inline at this point, in the release version they would be in a separate file.
# Link_Termination=Include # (All code points without other values) # Link_Termination=Hard # draft = [\p{whitespace}\p{NChar}\p{C}] # (not listing Unassigned or Surrogates) 0000..0020; Hard # (Cc) <control-0000>..(Zs) SPACE 007F..00A0; Hard # (Cc) <control-007F>..(Zs) NO-BREAK SPACE 00AD; Hard # (Cf) SOFT HYPHEN 0600..0605; Hard # (Cf) ARABIC NUMBER SIGN..(Cf) ARABIC NUMBER MARK ABOVE 061C; Hard # (Cf) ARABIC LETTER MARK 06DD; Hard # (Cf) ARABIC END OF AYAH 070F; Hard # (Cf) SYRIAC ABBREVIATION MARK 0890..0891; Hard # (Cf) ARABIC POUND MARK ABOVE..(Cf) ARABIC PIASTRE MARK ABOVE 08E2; Hard # (Cf) ARABIC DISPUTED END OF AYAH 1680; Hard # (Zs) OGHAM SPACE MARK 180E; Hard # (Cf) MONGOLIAN VOWEL SEPARATOR 2000..200F; Hard # (Zs) EN QUAD..(Cf) RIGHT-TO-LEFT MARK 2028..202F; Hard # (Zl) LINE SEPARATOR..(Zs) NARROW NO-BREAK SPACE 205F..2064; Hard # (Zs) MEDIUM MATHEMATICAL SPACE..(Cf) INVISIBLE PLUS 2066..206F; Hard # (Cf) LEFT-TO-RIGHT ISOLATE..(Cf) NOMINAL DIGIT SHAPES 3000; Hard # (Zs) IDEOGRAPHIC SPACE E000..F8FF; Hard # (Co) <private use area-E000>..(Co) <private use area-F8FF> FEFF; Hard # (Cf) ZERO WIDTH NO-BREAK SPACE FFF9..FFFB; Hard # (Cf) INTERLINEAR ANNOTATION ANCHOR..(Cf) INTERLINEAR ANNOTATION TERMINATOR 110BD; Hard # (Cf) KAITHI NUMBER SIGN 110CD; Hard # (Cf) KAITHI NUMBER SIGN ABOVE 13430..1343F; Hard # (Cf) EGYPTIAN HIEROGLYPH VERTICAL JOINER..(Cf) EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE 1BCA0..1BCA3; Hard # (Cf) SHORTHAND FORMAT LETTER OVERLAP..(Cf) SHORTHAND FORMAT UP STEP 1D173..1D17A; Hard # (Cf) MUSICAL SYMBOL BEGIN BEAM..(Cf) MUSICAL SYMBOL END PHRASE E0001; Hard # (Cf) LANGUAGE TAG E0020..E007F; Hard # (Cf) TAG SPACE..(Cf) CANCEL TAG F0000..FFFFD; Hard # (Co) <private use area-F0000>..(Co) <private use area-FFFFD> 100000..10FFFD; Hard # (Co) <private use area-100000>..(Co) <private use area-10FFFD> # Link_Termination=Soft # draft = [\p{Term}["'\u00AB\u00BB\u2018-\u201F\u2039\u203A]] 0021..0022; Soft # (Po) EXCLAMATION MARK..(Po) QUOTATION MARK 0027; Soft # (Po) APOSTROPHE 002C; Soft # (Po) COMMA 002E; Soft # (Po) FULL STOP 003A..003B; Soft # (Po) COLON..(Po) SEMICOLON 003F; Soft # (Po) QUESTION MARK 00AB; Soft # (Pi) LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 00BB; Soft # (Pf) RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 037E; Soft # (Po) GREEK QUESTION MARK 0387; Soft # (Po) GREEK ANO TELEIA 0589; Soft # (Po) ARMENIAN FULL STOP 05C3; Soft # (Po) HEBREW PUNCTUATION SOF PASUQ 060C; Soft # (Po) ARABIC COMMA 061B; Soft # (Po) ARABIC SEMICOLON 061D..061F; Soft # (Po) ARABIC END OF TEXT MARK..(Po) ARABIC QUESTION MARK 06D4; Soft # (Po) ARABIC FULL STOP 0700..070A; Soft # (Po) SYRIAC END OF PARAGRAPH..(Po) SYRIAC CONTRACTION 070C; Soft # (Po) SYRIAC HARKLEAN METOBELUS 07F8..07F9; Soft # (Po) NKO COMMA..(Po) NKO EXCLAMATION MARK 0830..0835; Soft # (Po) SAMARITAN PUNCTUATION NEQUDAA..(Po) SAMARITAN PUNCTUATION SHIYYAALAA 0837..083E; Soft # (Po) SAMARITAN PUNCTUATION MELODIC QITSA..(Po) SAMARITAN PUNCTUATION ANNAAU 085E; Soft # (Po) MANDAIC PUNCTUATION 0964..0965; Soft # (Po) DEVANAGARI DANDA..(Po) DEVANAGARI DOUBLE DANDA 0E5A..0E5B; Soft # (Po) THAI CHARACTER ANGKHANKHU..(Po) THAI CHARACTER KHOMUT 0F08; Soft # (Po) TIBETAN MARK SBRUL SHAD 0F0D..0F12; Soft # (Po) TIBETAN MARK SHAD..(Po) TIBETAN MARK RGYA GRAM SHAD 104A..104B; Soft # (Po) MYANMAR SIGN LITTLE SECTION..(Po) MYANMAR SIGN SECTION 1361..1368; Soft # (Po) ETHIOPIC WORDSPACE..(Po) ETHIOPIC PARAGRAPH SEPARATOR 166E; Soft # (Po) CANADIAN SYLLABICS FULL STOP 16EB..16ED; Soft # (Po) RUNIC SINGLE PUNCTUATION..(Po) RUNIC CROSS PUNCTUATION 1735..1736; Soft # (Po) PHILIPPINE SINGLE PUNCTUATION..(Po) PHILIPPINE DOUBLE PUNCTUATION 17D4..17D6; Soft # (Po) KHMER SIGN KHAN..(Po) KHMER SIGN CAMNUC PII KUUH 17DA; Soft # (Po) KHMER SIGN KOOMUUT 1802..1805; Soft # (Po) MONGOLIAN COMMA..(Po) MONGOLIAN FOUR DOTS 1808..1809; Soft # (Po) MONGOLIAN MANCHU COMMA..(Po) MONGOLIAN MANCHU FULL STOP 1944..1945; Soft # (Po) LIMBU EXCLAMATION MARK..(Po) LIMBU QUESTION MARK 1AA8..1AAB; Soft # (Po) TAI THAM SIGN KAAN..(Po) TAI THAM SIGN SATKAANKUU 1B4E..1B4F; Soft # (Po) BALINESE INVERTED CARIK SIKI..(Po) BALINESE INVERTED CARIK PAREREN 1B5A..1B5B; Soft # (Po) BALINESE PANTI..(Po) BALINESE PAMADA 1B5D..1B5F; Soft # (Po) BALINESE CARIK PAMUNGKAH..(Po) BALINESE CARIK PAREREN 1B7D..1B7F; Soft # (Po) BALINESE PANTI LANTANG..(Po) BALINESE PANTI BAWAK 1C3B..1C3F; Soft # (Po) LEPCHA PUNCTUATION TA-ROL..(Po) LEPCHA PUNCTUATION TSHOOK 1C7E..1C7F; Soft # (Po) OL CHIKI PUNCTUATION MUCAAD..(Po) OL CHIKI PUNCTUATION DOUBLE MUCAAD 2018..201F; Soft # (Pi) LEFT SINGLE QUOTATION MARK..(Pi) DOUBLE HIGH-REVERSED-9 QUOTATION MARK 2024; Soft # (Po) ONE DOT LEADER 2039..203A; Soft # (Pi) SINGLE LEFT-POINTING ANGLE QUOTATION MARK..(Pf) SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 203C..203D; Soft # (Po) DOUBLE EXCLAMATION MARK..(Po) INTERROBANG 2047..2049; Soft # (Po) DOUBLE QUESTION MARK..(Po) EXCLAMATION QUESTION MARK 2CF9..2CFB; Soft # (Po) COPTIC OLD NUBIAN FULL STOP..(Po) COPTIC OLD NUBIAN INDIRECT QUESTION MARK 2E2E; Soft # (Po) REVERSED QUESTION MARK 2E3C; Soft # (Po) STENOGRAPHIC FULL STOP 2E41; Soft # (Po) REVERSED COMMA 2E4C; Soft # (Po) MEDIEVAL COMMA 2E4E..2E4F; Soft # (Po) PUNCTUS ELEVATUS MARK..(Po) CORNISH VERSE DIVIDER 2E53..2E54; Soft # (Po) MEDIEVAL EXCLAMATION MARK..(Po) MEDIEVAL QUESTION MARK 3001..3002; Soft # (Po) IDEOGRAPHIC COMMA..(Po) IDEOGRAPHIC FULL STOP A4FE..A4FF; Soft # (Po) LISU PUNCTUATION COMMA..(Po) LISU PUNCTUATION FULL STOP A60D..A60F; Soft # (Po) VAI COMMA..(Po) VAI QUESTION MARK A6F3..A6F7; Soft # (Po) BAMUM FULL STOP..(Po) BAMUM QUESTION MARK A876..A877; Soft # (Po) PHAGS-PA MARK SHAD..(Po) PHAGS-PA MARK DOUBLE SHAD A8CE..A8CF; Soft # (Po) SAURASHTRA DANDA..(Po) SAURASHTRA DOUBLE DANDA A92F; Soft # (Po) KAYAH LI SIGN SHYA A9C7..A9C9; Soft # (Po) JAVANESE PADA PANGKAT..(Po) JAVANESE PADA LUNGSI AA5D..AA5F; Soft # (Po) CHAM PUNCTUATION DANDA..(Po) CHAM PUNCTUATION TRIPLE DANDA AADF; Soft # (Po) TAI VIET SYMBOL KOI KOI AAF0..AAF1; Soft # (Po) MEETEI MAYEK CHEIKHAN..(Po) MEETEI MAYEK AHANG KHUDAM ABEB; Soft # (Po) MEETEI MAYEK CHEIKHEI FE12; Soft # (Po) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP FE15..FE16; Soft # (Po) PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK..(Po) PRESENTATION FORM FOR VERTICAL QUESTION MARK FE50..FE52; Soft # (Po) SMALL COMMA..(Po) SMALL FULL STOP FE54..FE57; Soft # (Po) SMALL SEMICOLON..(Po) SMALL EXCLAMATION MARK FF01; Soft # (Po) FULLWIDTH EXCLAMATION MARK FF0C; Soft # (Po) FULLWIDTH COMMA FF0E; Soft # (Po) FULLWIDTH FULL STOP FF1A..FF1B; Soft # (Po) FULLWIDTH COLON..(Po) FULLWIDTH SEMICOLON FF1F; Soft # (Po) FULLWIDTH QUESTION MARK FF61; Soft # (Po) HALFWIDTH IDEOGRAPHIC FULL STOP FF64; Soft # (Po) HALFWIDTH IDEOGRAPHIC COMMA 1039F; Soft # (Po) UGARITIC WORD DIVIDER 103D0; Soft # (Po) OLD PERSIAN WORD DIVIDER 10857; Soft # (Po) IMPERIAL ARAMAIC SECTION SIGN 1091F; Soft # (Po) PHOENICIAN WORD SEPARATOR 10A56..10A57; Soft # (Po) KHAROSHTHI PUNCTUATION DANDA..(Po) KHAROSHTHI PUNCTUATION DOUBLE DANDA 10AF0..10AF5; Soft # (Po) MANICHAEAN PUNCTUATION STAR..(Po) MANICHAEAN PUNCTUATION TWO DOTS 10B3A..10B3F; Soft # (Po) TINY TWO DOTS OVER ONE DOT PUNCTUATION..(Po) LARGE ONE RING OVER TWO RINGS PUNCTUATION 10B99..10B9C; Soft # (Po) PSALTER PAHLAVI SECTION MARK..(Po) PSALTER PAHLAVI FOUR DOTS WITH DOT 10F55..10F59; Soft # (Po) SOGDIAN PUNCTUATION TWO VERTICAL BARS..(Po) SOGDIAN PUNCTUATION HALF CIRCLE WITH DOT 10F86..10F89; Soft # (Po) OLD UYGHUR PUNCTUATION BAR..(Po) OLD UYGHUR PUNCTUATION FOUR DOTS 11047..1104D; Soft # (Po) BRAHMI DANDA..(Po) BRAHMI PUNCTUATION LOTUS 110BE..110C1; Soft # (Po) KAITHI SECTION MARK..(Po) KAITHI DOUBLE DANDA 11141..11143; Soft # (Po) CHAKMA DANDA..(Po) CHAKMA QUESTION MARK 111C5..111C6; Soft # (Po) SHARADA DANDA..(Po) SHARADA DOUBLE DANDA 111CD; Soft # (Po) SHARADA SUTRA MARK 111DE..111DF; Soft # (Po) SHARADA SECTION MARK-1..(Po) SHARADA SECTION MARK-2 11238..1123C; Soft # (Po) KHOJKI DANDA..(Po) KHOJKI DOUBLE SECTION MARK 112A9; Soft # (Po) MULTANI SECTION MARK 113D4..113D5; Soft # (Po) TULU-TIGALARI DANDA..(Po) TULU-TIGALARI DOUBLE DANDA 1144B..1144D; Soft # (Po) NEWA DANDA..(Po) NEWA COMMA 1145A..1145B; Soft # (Po) NEWA DOUBLE COMMA..(Po) NEWA PLACEHOLDER MARK 115C2..115C5; Soft # (Po) SIDDHAM DANDA..(Po) SIDDHAM SEPARATOR BAR 115C9..115D7; Soft # (Po) SIDDHAM END OF TEXT MARK..(Po) SIDDHAM SECTION MARK WITH CIRCLES AND FOUR ENCLOSURES 11641..11642; Soft # (Po) MODI DANDA..(Po) MODI DOUBLE DANDA 1173C..1173E; Soft # (Po) AHOM SIGN SMALL SECTION..(Po) AHOM SIGN RULAI 11944; Soft # (Po) DIVES AKURU DOUBLE DANDA 11946; Soft # (Po) DIVES AKURU END OF TEXT MARK 11A42..11A43; Soft # (Po) ZANABAZAR SQUARE MARK SHAD..(Po) ZANABAZAR SQUARE MARK DOUBLE SHAD 11A9B..11A9C; Soft # (Po) SOYOMBO MARK SHAD..(Po) SOYOMBO MARK DOUBLE SHAD 11AA1..11AA2; Soft # (Po) SOYOMBO TERMINAL MARK-1..(Po) SOYOMBO TERMINAL MARK-2 11C41..11C43; Soft # (Po) BHAIKSUKI DANDA..(Po) BHAIKSUKI WORD SEPARATOR 11C71; Soft # (Po) MARCHEN MARK SHAD 11EF7..11EF8; Soft # (Po) MAKASAR PASSIMBANG..(Po) MAKASAR END OF SECTION 11F43..11F44; Soft # (Po) KAWI DANDA..(Po) KAWI DOUBLE DANDA 12470..12474; Soft # (Po) CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORD DIVIDER..(Po) CUNEIFORM PUNCTUATION SIGN DIAGONAL QUADCOLON 16A6E..16A6F; Soft # (Po) MRO DANDA..(Po) MRO DOUBLE DANDA 16AF5; Soft # (Po) BASSA VAH FULL STOP 16B37..16B39; Soft # (Po) PAHAWH HMONG SIGN VOS THOM..(Po) PAHAWH HMONG SIGN CIM CHEEM 16B44; Soft # (Po) PAHAWH HMONG SIGN XAUS 16D6E..16D6F; Soft # (Po) KIRAT RAI DANDA..(Po) KIRAT RAI DOUBLE DANDA 16E97..16E98; Soft # (Po) MEDEFAIDRIN COMMA..(Po) MEDEFAIDRIN FULL STOP 1BC9F; Soft # (Po) DUPLOYAN PUNCTUATION CHINOOK FULL STOP 1DA87..1DA8A; Soft # (Po) SIGNWRITING COMMA..(Po) SIGNWRITING COLON # Link_Termination=Close # draft = [\p{Bidi_Paired_Bracket_Type=Close}[>]] 0029; Close # (Pe) RIGHT PARENTHESIS 003E; Close # (Sm) GREATER-THAN SIGN 005D; Close # (Pe) RIGHT SQUARE BRACKET 007D; Close # (Pe) RIGHT CURLY BRACKET 0F3B; Close # (Pe) TIBETAN MARK GUG RTAGS GYAS 0F3D; Close # (Pe) TIBETAN MARK ANG KHANG GYAS 169C; Close # (Pe) OGHAM REVERSED FEATHER MARK 2046; Close # (Pe) RIGHT SQUARE BRACKET WITH QUILL 207E; Close # (Pe) SUPERSCRIPT RIGHT PARENTHESIS 208E; Close # (Pe) SUBSCRIPT RIGHT PARENTHESIS 2309; Close # (Pe) RIGHT CEILING 230B; Close # (Pe) RIGHT FLOOR 232A; Close # (Pe) RIGHT-POINTING ANGLE BRACKET 2769; Close # (Pe) MEDIUM RIGHT PARENTHESIS ORNAMENT 276B; Close # (Pe) MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT 276D; Close # (Pe) MEDIUM RIGHT-POINTING ANGLE BRACKET ORNAMENT 276F; Close # (Pe) HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT 2771; Close # (Pe) HEAVY RIGHT-POINTING ANGLE BRACKET ORNAMENT 2773; Close # (Pe) LIGHT RIGHT TORTOISE SHELL BRACKET ORNAMENT 2775; Close # (Pe) MEDIUM RIGHT CURLY BRACKET ORNAMENT 27C6; Close # (Pe) RIGHT S-SHAPED BAG DELIMITER 27E7; Close # (Pe) MATHEMATICAL RIGHT WHITE SQUARE BRACKET 27E9; Close # (Pe) MATHEMATICAL RIGHT ANGLE BRACKET 27EB; Close # (Pe) MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET 27ED; Close # (Pe) MATHEMATICAL RIGHT WHITE TORTOISE SHELL BRACKET 27EF; Close # (Pe) MATHEMATICAL RIGHT FLATTENED PARENTHESIS 2984; Close # (Pe) RIGHT WHITE CURLY BRACKET 2986; Close # (Pe) RIGHT WHITE PARENTHESIS 2988; Close # (Pe) Z NOTATION RIGHT IMAGE BRACKET 298A; Close # (Pe) Z NOTATION RIGHT BINDING BRACKET 298C; Close # (Pe) RIGHT SQUARE BRACKET WITH UNDERBAR 298E; Close # (Pe) RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2990; Close # (Pe) RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER 2992; Close # (Pe) RIGHT ANGLE BRACKET WITH DOT 2994; Close # (Pe) RIGHT ARC GREATER-THAN BRACKET 2996; Close # (Pe) DOUBLE RIGHT ARC LESS-THAN BRACKET 2998; Close # (Pe) RIGHT BLACK TORTOISE SHELL BRACKET 29D9; Close # (Pe) RIGHT WIGGLY FENCE 29DB; Close # (Pe) RIGHT DOUBLE WIGGLY FENCE 29FD; Close # (Pe) RIGHT-POINTING CURVED ANGLE BRACKET 2E23; Close # (Pe) TOP RIGHT HALF BRACKET 2E25; Close # (Pe) BOTTOM RIGHT HALF BRACKET 2E27; Close # (Pe) RIGHT SIDEWAYS U BRACKET 2E29; Close # (Pe) RIGHT DOUBLE PARENTHESIS 2E56; Close # (Pe) RIGHT SQUARE BRACKET WITH STROKE 2E58; Close # (Pe) RIGHT SQUARE BRACKET WITH DOUBLE STROKE 2E5A; Close # (Pe) TOP HALF RIGHT PARENTHESIS 2E5C; Close # (Pe) BOTTOM HALF RIGHT PARENTHESIS 3009; Close # (Pe) RIGHT ANGLE BRACKET 300B; Close # (Pe) RIGHT DOUBLE ANGLE BRACKET 300D; Close # (Pe) RIGHT CORNER BRACKET 300F; Close # (Pe) RIGHT WHITE CORNER BRACKET 3011; Close # (Pe) RIGHT BLACK LENTICULAR BRACKET 3015; Close # (Pe) RIGHT TORTOISE SHELL BRACKET 3017; Close # (Pe) RIGHT WHITE LENTICULAR BRACKET 3019; Close # (Pe) RIGHT WHITE TORTOISE SHELL BRACKET 301B; Close # (Pe) RIGHT WHITE SQUARE BRACKET FE5A; Close # (Pe) SMALL RIGHT PARENTHESIS FE5C; Close # (Pe) SMALL RIGHT CURLY BRACKET FE5E; Close # (Pe) SMALL RIGHT TORTOISE SHELL BRACKET FF09; Close # (Pe) FULLWIDTH RIGHT PARENTHESIS FF3D; Close # (Pe) FULLWIDTH RIGHT SQUARE BRACKET FF5D; Close # (Pe) FULLWIDTH RIGHT CURLY BRACKET FF60; Close # (Pe) FULLWIDTH RIGHT WHITE PARENTHESIS FF63; Close # (Pe) HALFWIDTH RIGHT CORNER BRACKET # Link_Termination=Open # draft = [\p{Bidi_Paired_Bracket_Type=Open}[<]] 0028; Open # (Ps) LEFT PARENTHESIS 003C; Open # (Sm) LESS-THAN SIGN 005B; Open # (Ps) LEFT SQUARE BRACKET 007B; Open # (Ps) LEFT CURLY BRACKET 0F3A; Open # (Ps) TIBETAN MARK GUG RTAGS GYON 0F3C; Open # (Ps) TIBETAN MARK ANG KHANG GYON 169B; Open # (Ps) OGHAM FEATHER MARK 2045; Open # (Ps) LEFT SQUARE BRACKET WITH QUILL 207D; Open # (Ps) SUPERSCRIPT LEFT PARENTHESIS 208D; Open # (Ps) SUBSCRIPT LEFT PARENTHESIS 2308; Open # (Ps) LEFT CEILING 230A; Open # (Ps) LEFT FLOOR 2329; Open # (Ps) LEFT-POINTING ANGLE BRACKET 2768; Open # (Ps) MEDIUM LEFT PARENTHESIS ORNAMENT 276A; Open # (Ps) MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT 276C; Open # (Ps) MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT 276E; Open # (Ps) HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT 2770; Open # (Ps) HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT 2772; Open # (Ps) LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT 2774; Open # (Ps) MEDIUM LEFT CURLY BRACKET ORNAMENT 27C5; Open # (Ps) LEFT S-SHAPED BAG DELIMITER 27E6; Open # (Ps) MATHEMATICAL LEFT WHITE SQUARE BRACKET 27E8; Open # (Ps) MATHEMATICAL LEFT ANGLE BRACKET 27EA; Open # (Ps) MATHEMATICAL LEFT DOUBLE ANGLE BRACKET 27EC; Open # (Ps) MATHEMATICAL LEFT WHITE TORTOISE SHELL BRACKET 27EE; Open # (Ps) MATHEMATICAL LEFT FLATTENED PARENTHESIS 2983; Open # (Ps) LEFT WHITE CURLY BRACKET 2985; Open # (Ps) LEFT WHITE PARENTHESIS 2987; Open # (Ps) Z NOTATION LEFT IMAGE BRACKET 2989; Open # (Ps) Z NOTATION LEFT BINDING BRACKET 298B; Open # (Ps) LEFT SQUARE BRACKET WITH UNDERBAR 298D; Open # (Ps) LEFT SQUARE BRACKET WITH TICK IN TOP CORNER 298F; Open # (Ps) LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2991; Open # (Ps) LEFT ANGLE BRACKET WITH DOT 2993; Open # (Ps) LEFT ARC LESS-THAN BRACKET 2995; Open # (Ps) DOUBLE LEFT ARC GREATER-THAN BRACKET 2997; Open # (Ps) LEFT BLACK TORTOISE SHELL BRACKET 29D8; Open # (Ps) LEFT WIGGLY FENCE 29DA; Open # (Ps) LEFT DOUBLE WIGGLY FENCE 29FC; Open # (Ps) LEFT-POINTING CURVED ANGLE BRACKET 2E22; Open # (Ps) TOP LEFT HALF BRACKET 2E24; Open # (Ps) BOTTOM LEFT HALF BRACKET 2E26; Open # (Ps) LEFT SIDEWAYS U BRACKET 2E28; Open # (Ps) LEFT DOUBLE PARENTHESIS 2E55; Open # (Ps) LEFT SQUARE BRACKET WITH STROKE 2E57; Open # (Ps) LEFT SQUARE BRACKET WITH DOUBLE STROKE 2E59; Open # (Ps) TOP HALF LEFT PARENTHESIS 2E5B; Open # (Ps) BOTTOM HALF LEFT PARENTHESIS 3008; Open # (Ps) LEFT ANGLE BRACKET 300A; Open # (Ps) LEFT DOUBLE ANGLE BRACKET 300C; Open # (Ps) LEFT CORNER BRACKET 300E; Open # (Ps) LEFT WHITE CORNER BRACKET 3010; Open # (Ps) LEFT BLACK LENTICULAR BRACKET 3014; Open # (Ps) LEFT TORTOISE SHELL BRACKET 3016; Open # (Ps) LEFT WHITE LENTICULAR BRACKET 3018; Open # (Ps) LEFT WHITE TORTOISE SHELL BRACKET 301A; Open # (Ps) LEFT WHITE SQUARE BRACKET FE59; Open # (Ps) SMALL LEFT PARENTHESIS FE5B; Open # (Ps) SMALL LEFT CURLY BRACKET FE5D; Open # (Ps) SMALL LEFT TORTOISE SHELL BRACKET FF08; Open # (Ps) FULLWIDTH LEFT PARENTHESIS FF3B; Open # (Ps) FULLWIDTH LEFT SQUARE BRACKET FF5B; Open # (Ps) FULLWIDTH LEFT CURLY BRACKET FF5F; Open # (Ps) FULLWIDTH LEFT WHITE PARENTHESIS FF62; Open # (Ps) HALFWIDTH LEFT CORNER BRACKET # Link_Paired_Opener # draft = BidiPairedBracket + (“>” GREATER-THAN SIGN 🡆 “<” LESS-THAN SIGN) 0029; 0028 # “)” RIGHT PARENTHESIS 🡆 “(” LEFT PARENTHESIS 003E; 003C # “>” GREATER-THAN SIGN 🡆 “<” LESS-THAN SIGN 005D; 005B # “]” RIGHT SQUARE BRACKET 🡆 “[” LEFT SQUARE BRACKET 007D; 007B # “}” RIGHT CURLY BRACKET 🡆 “{” LEFT CURLY BRACKET 0F3B; 0F3A # “༻” TIBETAN MARK GUG RTAGS GYAS 🡆 “༺” TIBETAN MARK GUG RTAGS GYON 0F3D; 0F3C # “༽” TIBETAN MARK ANG KHANG GYAS 🡆 “༼” TIBETAN MARK ANG KHANG GYON 169C; 169B # “᚜” OGHAM REVERSED FEATHER MARK 🡆 “᚛” OGHAM FEATHER MARK 2046; 2045 # “⁆” RIGHT SQUARE BRACKET WITH QUILL 🡆 “⁅” LEFT SQUARE BRACKET WITH QUILL 207E; 207D # “⁾” SUPERSCRIPT RIGHT PARENTHESIS 🡆 “⁽” SUPERSCRIPT LEFT PARENTHESIS 208E; 208D # “₎” SUBSCRIPT RIGHT PARENTHESIS 🡆 “₍” SUBSCRIPT LEFT PARENTHESIS 2309; 2308 # “⌉” RIGHT CEILING 🡆 “⌈” LEFT CEILING 230B; 230A # “⌋” RIGHT FLOOR 🡆 “⌊” LEFT FLOOR 232A; 2329 # “〉” RIGHT-POINTING ANGLE BRACKET 🡆 “〈” LEFT-POINTING ANGLE BRACKET 2769; 2768 # “❩” MEDIUM RIGHT PARENTHESIS ORNAMENT 🡆 “❨” MEDIUM LEFT PARENTHESIS ORNAMENT 276B; 276A # “❫” MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT 🡆 “❪” MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT 276D; 276C # “❭” MEDIUM RIGHT-POINTING ANGLE BRACKET ORNAMENT 🡆 “❬” MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT 276F; 276E # “❯” HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT 🡆 “❮” HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT 2771; 2770 # “❱” HEAVY RIGHT-POINTING ANGLE BRACKET ORNAMENT 🡆 “❰” HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT 2773; 2772 # “❳” LIGHT RIGHT TORTOISE SHELL BRACKET ORNAMENT 🡆 “❲” LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT 2775; 2774 # “❵” MEDIUM RIGHT CURLY BRACKET ORNAMENT 🡆 “❴” MEDIUM LEFT CURLY BRACKET ORNAMENT 27C6; 27C5 # “⟆” RIGHT S-SHAPED BAG DELIMITER 🡆 “⟅” LEFT S-SHAPED BAG DELIMITER 27E7; 27E6 # “⟧” MATHEMATICAL RIGHT WHITE SQUARE BRACKET 🡆 “⟦” MATHEMATICAL LEFT WHITE SQUARE BRACKET 27E9; 27E8 # “⟩” MATHEMATICAL RIGHT ANGLE BRACKET 🡆 “⟨” MATHEMATICAL LEFT ANGLE BRACKET 27EB; 27EA # “⟫” MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET 🡆 “⟪” MATHEMATICAL LEFT DOUBLE ANGLE BRACKET 27ED; 27EC # “⟭” MATHEMATICAL RIGHT WHITE TORTOISE SHELL BRACKET 🡆 “⟬” MATHEMATICAL LEFT WHITE TORTOISE SHELL BRACKET 27EF; 27EE # “⟯” MATHEMATICAL RIGHT FLATTENED PARENTHESIS 🡆 “⟮” MATHEMATICAL LEFT FLATTENED PARENTHESIS 2984; 2983 # “⦄” RIGHT WHITE CURLY BRACKET 🡆 “⦃” LEFT WHITE CURLY BRACKET 2986; 2985 # “⦆” RIGHT WHITE PARENTHESIS 🡆 “⦅” LEFT WHITE PARENTHESIS 2988; 2987 # “⦈” Z NOTATION RIGHT IMAGE BRACKET 🡆 “⦇” Z NOTATION LEFT IMAGE BRACKET 298A; 2989 # “⦊” Z NOTATION RIGHT BINDING BRACKET 🡆 “⦉” Z NOTATION LEFT BINDING BRACKET 298C; 298B # “⦌” RIGHT SQUARE BRACKET WITH UNDERBAR 🡆 “⦋” LEFT SQUARE BRACKET WITH UNDERBAR 298E; 298F # “⦎” RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 🡆 “⦏” LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2990; 298D # “⦐” RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER 🡆 “⦍” LEFT SQUARE BRACKET WITH TICK IN TOP CORNER 2992; 2991 # “⦒” RIGHT ANGLE BRACKET WITH DOT 🡆 “⦑” LEFT ANGLE BRACKET WITH DOT 2994; 2993 # “⦔” RIGHT ARC GREATER-THAN BRACKET 🡆 “⦓” LEFT ARC LESS-THAN BRACKET 2996; 2995 # “⦖” DOUBLE RIGHT ARC LESS-THAN BRACKET 🡆 “⦕” DOUBLE LEFT ARC GREATER-THAN BRACKET 2998; 2997 # “⦘” RIGHT BLACK TORTOISE SHELL BRACKET 🡆 “⦗” LEFT BLACK TORTOISE SHELL BRACKET 29D9; 29D8 # “⧙” RIGHT WIGGLY FENCE 🡆 “⧘” LEFT WIGGLY FENCE 29DB; 29DA # “⧛” RIGHT DOUBLE WIGGLY FENCE 🡆 “⧚” LEFT DOUBLE WIGGLY FENCE 29FD; 29FC # “⧽” RIGHT-POINTING CURVED ANGLE BRACKET 🡆 “⧼” LEFT-POINTING CURVED ANGLE BRACKET 2E23; 2E22 # “⸣” TOP RIGHT HALF BRACKET 🡆 “⸢” TOP LEFT HALF BRACKET 2E25; 2E24 # “⸥” BOTTOM RIGHT HALF BRACKET 🡆 “⸤” BOTTOM LEFT HALF BRACKET 2E27; 2E26 # “⸧” RIGHT SIDEWAYS U BRACKET 🡆 “⸦” LEFT SIDEWAYS U BRACKET 2E29; 2E28 # “⸩” RIGHT DOUBLE PARENTHESIS 🡆 “⸨” LEFT DOUBLE PARENTHESIS 2E56; 2E55 # “⹖” RIGHT SQUARE BRACKET WITH STROKE 🡆 “⹕” LEFT SQUARE BRACKET WITH STROKE 2E58; 2E57 # “⹘” RIGHT SQUARE BRACKET WITH DOUBLE STROKE 🡆 “⹗” LEFT SQUARE BRACKET WITH DOUBLE STROKE 2E5A; 2E59 # “⹚” TOP HALF RIGHT PARENTHESIS 🡆 “⹙” TOP HALF LEFT PARENTHESIS 2E5C; 2E5B # “⹜” BOTTOM HALF RIGHT PARENTHESIS 🡆 “⹛” BOTTOM HALF LEFT PARENTHESIS 3009; 3008 # “〉” RIGHT ANGLE BRACKET 🡆 “〈” LEFT ANGLE BRACKET 300B; 300A # “》” RIGHT DOUBLE ANGLE BRACKET 🡆 “《” LEFT DOUBLE ANGLE BRACKET 300D; 300C # “」” RIGHT CORNER BRACKET 🡆 “「” LEFT CORNER BRACKET 300F; 300E # “』” RIGHT WHITE CORNER BRACKET 🡆 “『” LEFT WHITE CORNER BRACKET 3011; 3010 # “】” RIGHT BLACK LENTICULAR BRACKET 🡆 “【” LEFT BLACK LENTICULAR BRACKET 3015; 3014 # “〕” RIGHT TORTOISE SHELL BRACKET 🡆 “〔” LEFT TORTOISE SHELL BRACKET 3017; 3016 # “〗” RIGHT WHITE LENTICULAR BRACKET 🡆 “〖” LEFT WHITE LENTICULAR BRACKET 3019; 3018 # “〙” RIGHT WHITE TORTOISE SHELL BRACKET 🡆 “〘” LEFT WHITE TORTOISE SHELL BRACKET 301B; 301A # “〛” RIGHT WHITE SQUARE BRACKET 🡆 “〚” LEFT WHITE SQUARE BRACKET FE5A; FE59 # “﹚” SMALL RIGHT PARENTHESIS 🡆 “﹙” SMALL LEFT PARENTHESIS FE5C; FE5B # “﹜” SMALL RIGHT CURLY BRACKET 🡆 “﹛” SMALL LEFT CURLY BRACKET FE5E; FE5D # “﹞” SMALL RIGHT TORTOISE SHELL BRACKET 🡆 “﹝” SMALL LEFT TORTOISE SHELL BRACKET FF09; FF08 # “)” FULLWIDTH RIGHT PARENTHESIS 🡆 “(” FULLWIDTH LEFT PARENTHESIS FF3D; FF3B # “]” FULLWIDTH RIGHT SQUARE BRACKET 🡆 “[” FULLWIDTH LEFT SQUARE BRACKET FF5D; FF5B # “}” FULLWIDTH RIGHT CURLY BRACKET 🡆 “{” FULLWIDTH LEFT CURLY BRACKET FF60; FF5F # “⦆” FULLWIDTH RIGHT WHITE PARENTHESIS 🡆 “⦅” FULLWIDTH LEFT WHITE PARENTHESIS FF63; FF62 # “」” HALFWIDTH RIGHT CORNER BRACKET 🡆 “「” HALFWIDTH LEFT CORNER BRACKET
For comparison to the related General_Category values, see the characters in:
TBD: The plan is to have two types of test lines, something like the following.
@Linkification # Field 0: Source # Field 1: Expected Linkification, where: ⸠ is at the start, and ⸡ is at the end See example.com! on…; See ⸠example.com! on… See example.com/αβγ on…; See ⸠example.com/αβγ⸡ on… See example.com?αβγ on…; See ⸠example.com?αβγ⸡ on… See example.com#αβγ on…; See ⸠example.com#αβγ⸡ on… See example.com/αβγ/δεζ?θικ#λμν on…; See ⸠example.com/αβγ/δεζ?θικ#λμν⸡ on… See example.com/αβγ/δεζ?δ.εφ#λμν on…; See ⸠example.com/αβγ/δεζ?δ.εφ#λμν⸡ on… See example.com/αβγ/δεζ?δ εφ#λμν on…; See ⸠example.com/αβγ/δεζ?δ⸡ εφ#λμν on… # Break on hard (' ') See example.com/αβγ/δεζ?δ. εφ#λμν on…; See ⸠example.com/αβγ/δεζ?δ⸡. εφ#λμν on… # Break on soft ('.') followed by hard (' ') See example.com/α/βγ?δ/ε?ζ#λ/μ?ν#π on…; See ⸠example.com/α/βγ?δ/ε?ζ#λ/μ?ν#π⸡ on… See example.com/αβ) on…; See ⸠example.com/αβ⸡) on… # Break on unmatched bracket See example.com/α(β) on…; See ⸠example.com/α(β)⸡ on… # Include matched bracket See example.com/αβ(γ/δ)ρς?θικ#λμν on…; See ⸠example.com/αβ(γ/δ)ρς?θικ#λμν⸡ on… # Includes matching across interior syntax — consider changing @Minimal-Escaping # Field 0: Path # Field 1: Query # Field 2: Fragment # Field 4: Expected result https://example.com; α; ; ; https://example.com/α # Path only https://example.com; ; α; ; https://example.com?α # Query only https://example.com; ; ; α; https://example.com#α # Fragment only https://example.com; αβγ/δεζ; θ=ικλ&μ=γξο; πρς; https://example.com/αβγ/δεζ?θ=ικλ&μ=γξο#πρς # All parts https://example.com; α?μπ; ; ; https://example.com/α%3Fμπ # Escape ? in Path https://example.com; α#β; γ=δ#ε; ; https://example.com/α%23β?γ=δ%23ε # Escape # in Path/Query https://example.com; αβ γ/δεζ; θ=ικ λ&=γξο; πρ σ; https://example.com/αβ%20γ/δεζ?θ=ικ%20λ&=γξο#πρ%20σ # Escape hard (' ') https://example.com; αβγ./δεζ.; θ=ικ.λ&=γξο.; πρς.; https://example.com/αβγ./δεζ.?θ=ικ.λ&=γξο.#πρς%2E # Escape soft ('.') unless followed by include https://example.com; α(β)); γ(δ)); ε(ζ)); https://example.com/α(β)%29?γ(δ)%29#ε(ζ)%29 # Escape unmatched brackets
For scripts that don’t need spaces between words, it is a bit tricky to linkify within sentences. For example, take:
The URL is set off from the rest of the text. But then look at it in the equivalent Japanese sentence:
That would not maintain a separation between the text if simply substituted for x in a phrase like “xは重要なページです” — so the linkification would go too far. One would need some kind of separator character to separate the text. That can be done with Hard characters (eg, space):
Or with Close characters, such as:
One could consider modifying the algorithm to provide for a termination between non-spacing scripts and spacing scripts. That wouldn’t help with the above examples, but would help with cases like:
However, that would complicate the behavior for little overall benefit.
One might consider adding quotation marks to Open/Close, but that would make the algorithm much more complicated. The problem is that the items are not uniquely Close or Open and the pairings are not 1:1 in natural languages. So these characters are categorized as Soft. Examples:
Open(s) | Close | |||
---|---|---|---|---|
" | " | |||
' | ' | |||
„ | “ | |||
‚ | ‘ | |||
‟ | “ | ” | „ | ” |
‛ | ‘ | ’ | ‚ | ’ |
‹ | › | |||
› | ‹ | |||
« | » | |||
» | « |
There is a further complication, that some quotation marks appear in non-paired usage, such as RIGHT SINGLE QUOTATION MARK or APOSTROPHE, but also QUOTATION MARK as an alternative to HEBREW PUNCTUATION GERSHAYIM. The simplest and most predictable solution is to have them be Soft.
The < and > characters are added to Link_Paired_Opener to set off URLS, such as <https://eel.is/c++draft/vector.bool.pspc#lib:vector<bool>> and <https://wg21.link/p2348>. While many sources that formerly recommended that practice no longer do (such as the Chicago Manual of Style), others have continued the practice, such as in C++ sg16.
TBD
TBD
The following summarizes modifications from the previous revision of this document.
Post working-draft L2/24-217, based on discussion during the UTC #181 meeting.
Modifications for previous versions are listed in those respective versions.
© 2024–2024 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.
Use of all Unicode Products, including this publication, is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.