Proposed Draft Unicode® Technical Standard #58

Unicode Linkification

Version	1.0
Editors	Mark Davis
Date	2024-11-13
This Version	https://www.unicode.org/reports/tr58/tr58-1.html
Previous Version	none
Latest Version	https://www.unicode.org/reports/tr58/
Latest Proposed Update	https://www.unicode.org/reports/tr58/proposed.html
Revision	1

Summary

This document specifies a standard mechanism for detecting URLs embedded in plain text — in particular, detecting URLs containing non-ASCII characters. It also defines the minimally necessary escaping of non-ASCII code points in the Path, Query, and Fragment portions of a URL that aligns with the mechanism for detecting URLs.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For more information see About Unicode Technical Reports and the Specifications FAQ. Unicode Technical Reports are governed by the Unicode Terms of Use.

1 Introduction
2 Conformance
3 Link Detection link-detection-algorithm
4 Minimal Escaping
5 Security Considerations
6 Property Data
7 Test Data
Review Issues
References
Acknowledgments
Modifications

1 Introduction

With most email programs, when someone pastes in the plain text:

The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.

and sends to someone else, they receive it as:

The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.

URLs are also “linkified” in many other applications, such when pasting into a word processor (triggered by typing a space afterwards, for example). However, many products (many text messaging apps, video messaging chats, etc.) completely fail to recognize any non-ASCII characters past the domain name. And even among those that do recognize such non-ASCII characters, there are gratuitous differences in where they stop linkifying.

Linkification is the process of adding links to URLs in plain text input, such as in emails, text messaging, or video meeting chats. The first step in this process is link detection, which is determining the boundaries of spans of text that contain URLs. That substring can then have a link applied to it in output text. The functions that perform these operations are called a linkifier and link detector, respectively.

The specifications for a URL don’t specify how to handle link detection, since they are only concerned with the structure in isolation, not when it is embedded within flowing text. The lack of a clear specification for link detection also causes many implementations to overuse percent escaping for non-ASCII characters when converting URLs into plain text.

Notes

Following WhatWG URL: Goals, this specification uses the term URL broadly, as including unescaped non-ASCII characters; that is, as utilizing the formal definition of IRIs. See also the W3C's An Introduction to Multilingual Web Addresses.

In examples, links will be shown with a background color, to make the extent of the linkification clear.

The linkification process for URLs is already fragmented — with different implementations producing very different results — but it is amplified with the addition of non-ASCII characters, which often have very different behavior. That is, developers’ lack of familiarity with the behavior of non-ASCII characters has caused the different implementations of linkification to splinter. Yet non-ASCII characters are very important for readability. People do not want to see the above URL expressed in escaped ASCII:

The page https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%AB%E3%83%99%E3%83%AB%E3%83%88%29%E3%82%A2%E3%82%A4%E3%83%B3%E3%82%B7%E3%83%A5%E3%82%BF%E3%82%A4%E3%83%B3 contains information about Albert Einstein.

For example, take the lists of links on List of articles every Wikipedia should have in the available languages. When those are tested with major products, there are significant differences: any two implementations are likely to linkify those differently, such as terminating the linkification at different places, or not linkifying at all. That makes it very difficult to exchange URLs between products within plaintext, which is done surprisingly often — definitely causing problems for implementations that need predictable behavior.

This inconsistency causes problems for users and software companies. Having consistent rules for linkification also has additional benefits, leading to solutions for the following reported problems:

If a system allows users to have their own user ids that end up in URLs, like https://www.linkedin.com/in/my.user.name, it can avoid user ids that have problematic linkification behavior, like trailing periods after path segments.
Because linkification cannot be predicted for URLs with non-ASCII characters, common practice is to exchange them with escaped characters, which gives unreadable results such as the long line above.

There are many use cases for reducing the % encoding in URLs. For example, consider the common practice of providing user handles with such as:

x.com/jaketapper
bsky.app/profile/jaketapper.bsky.social
www.instagram.com/vancityreynolds/
www.youtube.com/@핑크퐁

The first three of these work well in practice. Copying from the address bar and pasting into text provides a readable result. However, with non-ASCII handles (that is, for the majority of the world's population), results in the unreadable https://www.youtube.com/@%E0%A6%AC%E0%A6%B0%E0%A6%BF%E0%A6%B6%E0%A6%BE%E0%A6%87%E0%A6%B2%E0%A7%8D%E0%A6%B2%E0%A6%BE%E0%A6%B9_%E0%A6%AE%E0%A6%A8%E0%A7%81.

If linkification behavior becomes more predictable across platforms and applications, applications will be able to do minimal escaping. For example, in the following only one character would need escaping, the %29 — representing an unmatched “)”:

https://ja.wikipedia.org/wiki/アルベルト%29アインシュタイン

Providing a consistent, predictable solution that works well across the world’s languages requires a standardized algorithm to define the behavior, and the corresponding Unicode character properties covering all Unicode characters.

Review Note: This draft has not been copy-edited; that is done in later drafts. The Table of Contents will be fleshed out at that point also.

2 Conformance

UTS58-C1. For a given version of Unicode, a conformant implementation shall replicate the same link detection results as those produced by Section 3, Link Detection Algorithm.

UTS58-C2. For a given version of Unicode, a conformant implementation shall replicate the same minimal escaping results as those produced by Section 4, Minimal Escaping.

3 Link Detection Algorithm

The following table shows the relevant parts of a URL. For clarity, the separator characters are included in the examples. For more information see WhatWG's URL: Example URL Components .

Parts of a URL

Protocol	Host (incl. Domain)	Port	Path	Query	Fragment
https://	docs.foobar.com	:8000	/knowledge/area/	?name=article&topic=seo	#top

Note that the Protocol, Port, Path, Query, and Fragment are each optional.

Processes

There are two main processes involved in Unicode link detection.

Initiation. This requires determining the point within plaintext where the parsing of a URL starts. When the scheme is present for a URL (such as “http://”), determining the start of link detection is simple. However, the scheme for an URL is commonly omitted when URLs are represented in text. For example, the string “adobe.com” should be recognized as being an URL when it occurs in the body of an email message, even though it does not have a scheme.
Termination. This requires determining the point within plaintext where the parsing of a URL ends. A formal reading of the URL specs allows almost any character in certain fields, so it is insufficient for separating the end of the URL from the non-URL text after it.

Initiation

The start of a URL is easy to determine when it has a known protocol (eg, “https://”).

Implementations have also developed heuristics for determining the start of the URL when the protocol is elided, taking advantage of the fact that there are relatively few top-level domains. And those techniques can be easily applied to internationalized domain names, which still have strong limitations on the valid characters. So the end of the domain name is also relatively easy to determine. For more information, see UTS #46, Unicode IDNA Compatibility Processing

The parsing up to the path, query, or fragment is as specified in WhatWG URL: 4.4. URL parsing.

For example, implementations must terminate link detection if a forbidden host code point is encountered, or if the host is a domain and a forbidden domain code point is encountered. Implementations must not linkify if a domain is not a registrable domain. The terms forbidden host code point, forbidden domain code point, and registrable domain are defined in WhatWG URL: Host representation.

For example, an implementation would parse to the end of microsoft.com and google.de, foo.рф, or xn--j1ay.xn--p1ai.

Termination

Termination is much more challenging, because of the presence of characters from many different writing systems. While small, hard-coded sets of characters suffice for an ASCII implementation, there are over 150,000 Unicode characters, many with quite different behavior than ASCII. While in theory, almost any Unicode character can occur in certain fields in an URL, in practice many characters have very restricted usage in URLs.

Initiation stops at any Path, Query, or Fragment, so the termination process takes over with a “/”, “?”, or “#” character. Each Path, Query, or Fragment can contain most Unicode characters. The key is to be able to determine, given a Part (such as a Query), when a sequence of characters should cause termination of the link detection, even though that character would be valid in the URL specification.

It is impossible for a link detection algorithm to match user expectations in all circumstances, given the variation in usage of various characters both within and across languages. So the goal is to cover use cases as broadly as possible, recognizing that it will sometimes not match user expectations in certain cases. Exceptional cases (URLs that need to use characters that would terminate) can still be appropriately linkified if those few characters are represented with % escapes.

At a high level, this specification defines three features:

A method for identifying when to terminate link detection based on properties that define contexts for terminating the parsing of a URL.
- This addresses the question, for example, when a trailing period should be counted as part of a link or not.
A method for identifying balanced quotes and brackets that enclose a URL
- This addresses the distinction, for example, of enclosing the entire URL in parentheses, vs. URLs that contain a part that is enclosed in parens, etc.
An algorithm for doing the above, together with an enumerated property and a mapping.

One of the goals is also predictability; it should be relatively easy for users to understand the link detection behavior at a high level.

Properties

This specification defines two properties: Link_Termination (LTerm) and Link_Paired_Opener (LOpener).

Link_Termination Property

Link_Termination is an enumerated property of characters with five enumerated values: {Include, Hard, Soft, Close, Open}

Value	Description / Examples
Include	There is no stop before the character; it is included in the link.
	Example: letters https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン
Hard	The URL terminates before this character.
	Example: a space Go to https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン to find the material.
Soft	The URL terminates before this character, if it is followed by `/\p{lt=Soft}*(\p{lt=Hard}\|$)/`
	Example: a question mark https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン?abc https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン? abc https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン?
Close	If the character is paired with a previous character in the same part (path, query, fragment), it is treated as Include. Otherwise it is treated as Hard. [Review Note: for paths, should this be limited to the same segment between '/' characters?]
	Example: an end parenthesis https://ja.wikipedia.org/wiki/(アルベルト)アインシュタインアインシュタイン) (https://ja.wikipedia.org/wiki/アルベルト)アインシュタイン (https://ja.wikipedia.org/wiki/アルベルトアインシュタイン
Open	Used to match Close characters.
	Example: same as under Close

Link_Paired_Opener Property

Link_Paired_Opener is a string property of characters, which for each character in \p{Link_Termination=Close}, returns a character with \p{Link_Termination=Open}.

Review Note: Also see the Review Issues.

Example

Link_Paired_Opener('}') == '{'

The specification of the characters with each of these property values is given in Property Assignments.

Termination Algorithm

The termination algorithm assumes that a domain (or other host) has been successfully parsed to the start of a Path, Query, or Fragment, as per the algorithm in WhatWG URL: 3. Hosts (domains and IP addresses) .

This algorithm then processes each final part [path, query, fragment] of the URL in turn. It stops when it encounters a code point that meets one of the terminating conditions and reports the last location in the current part that is still safely considered part of the link. The common terminating conditions are based on the Link_Termination and Link_Paired_Opener properties:

A Link_Termination=Hard character, such as a space. Within a Path, “?” and “#” are handled as Hard. Within a Query, “#’ is handled as Hard.
A Link_Termination=Soft character, such as a ? that is followed by a sequence of zero or more Soft characters, then either a Hard character or the end of the text.
A Link_Termination=Close character, such as a ] that does not have a matching Open character in the same part of the URL. The matching process uses the Link_Paired_Opener property to determine the correct Open character, and matches against the top element of a stack of Open characters.

More formally:

The termination algorithm begins after the Host (and optionally Port) have been parsed, so there is potentially a Path, Query, or Fragment. In the algorithm below, each of those Parts has an initiator character and zero to two hard terminator characters.

Part	initiator	terminators
path	'/'	[?#]
query	'?'	[#]
fragment	'#'	[]

Note: cp[i] refers to the i^th code point in the string being parsed, cp[start] is the first code point being considered, and n is the length of the string.

Set lastSafe to 0 — this marks the offset after the last code point that is included in the link detection (so far).
Set part to the Part whose initiator == cp[i]. If there is none, stop and return lastSafe.
Clear the openStack.
Loop from i = 0 to n - 1
1. Set LT to Link_Termination(cp[i])
2. If LT == Include
  1. If part.terminators contains cp[i]
    1. Set part to the Part whose initiator == cp[i]
    2. Clear the openStack.
  2. Set lastSafe to be i+1
  3. Continue loop
3. If LT == Soft
  1. Continue loop
4. If LT == Hard
  1. Stop and return lastSafe
5. If LT == Open
  1. Push cp[i] onto openStack
  2. Set lastSafe to be i+1
  3. Continue loop.
6. If LT == Close
  1. If openingStack is empty
    1. Stop and return lastSafe
  2. Set lastOpen to the pop of openStack
  3. If Link_Paired_Opener(cp[i]) == lastOpen
    1. Set lastSafe to be i+1
    2. Continue loop.
  4. Else stop and return lastSafe.
After the loop terminates, return lastSafe.

For ease of understanding, this algorithm does not include all features of URL parsing, such as ensuring that every % character is followed by two ASCII hex digits.

The algorithm can be optimized in various ways, of course, as long as the results are the same.

Property Assignments

The draft property assignments are derived according to the following descriptions. Most characters that cause link termination would still be valid, but require % encoding.

Review Note: These will be generated in a standard property data file. The following are initial assignments of properties; they should be reviewed to see where they need enhancement. The exact UnicodeSets are provided for experimentation, and a full listing of the draft assignments supplied in Property Data.

Link_Termination=Hard

Whitespace, non-characters, format, controls, private-use, surrogates, unassigned,...

[\p{whitespace}\p{NChar}\p{C}]]

Review Notes:

It is likely that we will want to special-case certain high-frequency format characters, such as ZWJ, ZWNJ, TAGs, and so on — but in very restricted contexts.]
The algorithm already disallows matching brackets across the initiator syntax characters for Query and Fragment. We may also want to restrict the matching of brackets across interior syntax characters, that is, across [/] in Path, and any of [?=&] in Query. Examples of how this would limit link detection are example.com/ab(/cd)/de, and example.com/?a=(cd&b=ef)&g=hi].

Link_Termination=Soft

Termination characters and quotation marks:

The contents of the second bullet are expanded in the following table:

Char.	Code Point	Name
"	`U+0022`	QUOTATION MARK
'	`U+0027`	APOSTROPHE
«	`U+00AB`	LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
»	`U+00BB`	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
‘	`U+2018`	LEFT SINGLE QUOTATION MARK
’	`U+2019`	RIGHT SINGLE QUOTATION MARK
‚	`U+201A`	SINGLE LOW-9 QUOTATION MARK
‛	`U+201B`	SINGLE HIGH-REVERSED-9 QUOTATION MARK
“	`U+201C`	LEFT DOUBLE QUOTATION MARK
”	`U+201D`	RIGHT DOUBLE QUOTATION MARK
„	`U+201E`	DOUBLE LOW-9 QUOTATION MARK
‟	`U+201F`	DOUBLE HIGH-REVERSED-9 QUOTATION MARK
‹	`U+2039`	SINGLE LEFT-POINTING ANGLE QUOTATION MARK
›	`U+203A`	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK

Link_Termination=Open, Link_Termination=Close

Derived from Link_Paired_Opener property

Link_Termination=Include

All other code points

Link_Paired_Opener

if BidiPairedBracketType(cp) == Close then Link_Paired_Opener(cp) = BidPairedBracket(cp)

else if cp == ">" then Link_Paired_Opener(cp) = "<"

else Link_Paired_Opener(cp) = \x{0}

See Bidi_Paired_Bracket.

4 Minimal Escaping

The goal is to be able to generate a serialized form of a URL that:

is correctly parsed by modern browsers and other devices
minimizes the use of percent-escapes
is completely link-detected when isolated.
1. For example, “abc.com/path1./path2.” would serialize as "abc.com/path./path2%2E" so that linkification will identify all of the serialized form within plaintext such as “See abc.com/path./path2%2E for more information”.
2. If not surrounded by Hard characters, the linkification may extend beyond the bounds of the serialized form. For example, “See Xabc.com/path./path2%2EX for more information”.

Notes:

The minimal escaping algorithm is parallel to the linkification algorithm. Basically, when serializing a URL, a character in a Path, Query, or Fragment is only percent-escaped if it is: Hard, Close when unmatched, or Soft when it is the code point in the part.

In the following:

cp[i] refers to the i^th code point in the part being serialized, cp[0] is the first code point in the part, and n is the number of code points.
The algorithm assumes that the Path, Query, and Fragment have the normal interior escaping for syntactic characters such as the part.terminators and a “/” within part of a Path.

Minimal Escaping Algorithm

Set output to ""
Process all Parts up to the Path, Query, and Fragment in the normal fashion, successively appending to output
For each part in any non-empty Path, Query, Fragment, successively:
1. Append to output: part.initializer
2. Set copiedAlready = 0
3. Clear the openStack
4. Loop from i = 0 to n - 1
  1. If part.terminators contains cp[i]
    1. Set LT to Hard
  2. Else set LT to Link_Termination(cp[i])
  3. If LT == Include
    1. Append to output: any code points between copiedAlready (inclusive) and i (exclusive)
    2. Append to output: cp[i]
    3. Set copiedAlready to i+1
    4. Continue loop
  4. If LT == Hard
    1. Append to output: any code points between copiedAlready (inclusive) and i (exclusive)
    2. Append to output: percentEscape(cp[i])
    3. Set copiedAlready to i+1
    4. Continue loop
  5. If LT == Soft
    1. Continue loop
  6. If LT == Open
    1. Push cp[i] onto openStack
    2. Do the same as LT == Include
  7. If LT == Close
    1. Set lastOpen to the pop of openStack, or 0 if the openStack is empty
    2. If Link_Paired_Opener(cp[i]) == lastOpen
      1. Do the same as LT == Include
    3. Else do the same as LT == Hard
5. If part is not last
  1. Append to output: any code points between copiedAlready (inclusive) and n (exclusive)
6. Else if copiedAlready < n
  1. Append to output: any code points between copiedAlready (inclusive) and n-1 (exclusive)
  2. Append to output: percentEscape(cp[i])
Return output.

The algorithm can be optimized in various ways, of course, as long as the results are the same. For example, the interior escaping for syntactic characters can be combined into a single pass.

Additional characters can be escaped to reduce confusability, especially when they are confusable with URL syntax characters, such as a Ɂ character in a path. See Security Considerations below.

5 Security Considerations

The security considerations for Path, Query, and Fragment are far less important than for Domain names. See UTS #39: Unicode Security for more information about domain names. The Format characters (\p{Cf}) are categorized as Link_Termination=Hard because they are zero-width and typically invisible. To ensure that users are aware of them, they need to be escaped (and thus visible) to be included in linkification.

Review Note: However, some of the Format characters may be used sufficiently frequently in text, and in sufficiently well-defined contexts, that they should instead be Include , so that they don't require % escaping in plain text. For example, we could allow in linkification:

unescaped ZWJ or ZWNJ, but only between the types of characters as specified in UTS #39: Unicode Security, Section 3.1.1.1 Limited Contexts for Joining Controls, or
a sequence of unescaped TAG characters, but only when following an U+1F3F4 BLACK FLAG character and matching tag_spec + U+E007F CANCEL TAG as per UTS #51: Unicode Emoji, C.1 Flag Emoji Tag Sequences.

There are documented cases of how Format characters can be used to sneak malicious instructions into LLMs; see Invisible text that AI chatbots understand and humans can’t? URLs are just a small part of the larger problem of feeding clean text to LLMs, both in building them and in querying them: making sure the text does not have malformed encodings, is in a consistent Unicode Normalization Form (NFC), and so on.

For security implications of URLs in general, see UTS #39: Unicode Security Mechanisms. For related issues, see UTS #55 Unicode Source Code Handling. For display of BIDI URLs, see also HL4 in UAX #9, Unicode Bidirectional Algorithm.

6 Property Data

The following lists the draft assignment of Link_Termination and Link_Paired_Opener property values. Although these are embedded inline at this point, in the release version they would be in a separate file.

#	Link_Termination=Include
#   (All code points without other values)

#	Link_Termination=Hard
#   draft = [\p{whitespace}\p{NChar}\p{C}]
#   (not listing Unassigned or Surrogates)

0000..0020;     Hard	# (Cc) <control-0000>..(Zs) SPACE
007F..00A0;     Hard	# (Cc) <control-007F>..(Zs) NO-BREAK SPACE
00AD;           Hard	# (Cf) SOFT HYPHEN
0600..0605;     Hard	# (Cf) ARABIC NUMBER SIGN..(Cf) ARABIC NUMBER MARK ABOVE
061C;           Hard	# (Cf) ARABIC LETTER MARK
06DD;           Hard	# (Cf) ARABIC END OF AYAH
070F;           Hard	# (Cf) SYRIAC ABBREVIATION MARK
0890..0891;     Hard	# (Cf) ARABIC POUND MARK ABOVE..(Cf) ARABIC PIASTRE MARK ABOVE
08E2;           Hard	# (Cf) ARABIC DISPUTED END OF AYAH
1680;           Hard	# (Zs) OGHAM SPACE MARK
180E;           Hard	# (Cf) MONGOLIAN VOWEL SEPARATOR
2000..200F;     Hard	# (Zs) EN QUAD..(Cf) RIGHT-TO-LEFT MARK
2028..202F;     Hard	# (Zl) LINE SEPARATOR..(Zs) NARROW NO-BREAK SPACE
205F..2064;     Hard	# (Zs) MEDIUM MATHEMATICAL SPACE..(Cf) INVISIBLE PLUS
2066..206F;     Hard	# (Cf) LEFT-TO-RIGHT ISOLATE..(Cf) NOMINAL DIGIT SHAPES
3000;           Hard	# (Zs) IDEOGRAPHIC SPACE
E000..F8FF;     Hard	# (Co) <private use area-E000>..(Co) <private use area-F8FF>
FEFF;           Hard	# (Cf) ZERO WIDTH NO-BREAK SPACE
FFF9..FFFB;     Hard	# (Cf) INTERLINEAR ANNOTATION ANCHOR..(Cf) INTERLINEAR ANNOTATION TERMINATOR
110BD;          Hard	# (Cf) KAITHI NUMBER SIGN
110CD;          Hard	# (Cf) KAITHI NUMBER SIGN ABOVE
13430..1343F;   Hard	# (Cf) EGYPTIAN HIEROGLYPH VERTICAL JOINER..(Cf) EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE
1BCA0..1BCA3;   Hard	# (Cf) SHORTHAND FORMAT LETTER OVERLAP..(Cf) SHORTHAND FORMAT UP STEP
1D173..1D17A;   Hard	# (Cf) MUSICAL SYMBOL BEGIN BEAM..(Cf) MUSICAL SYMBOL END PHRASE
E0001;          Hard	# (Cf) LANGUAGE TAG
E0020..E007F;   Hard	# (Cf) TAG SPACE..(Cf) CANCEL TAG
F0000..FFFFD;   Hard	# (Co) <private use area-F0000>..(Co) <private use area-FFFFD>
100000..10FFFD; Hard	# (Co) <private use area-100000>..(Co) <private use area-10FFFD>


#	Link_Termination=Soft
#   draft = [\p{Term}["'\u00AB\u00BB\u2018-\u201F\u2039\u203A]]

0021..0022;     Soft	# (Po) EXCLAMATION MARK..(Po) QUOTATION MARK
0027;           Soft	# (Po) APOSTROPHE
002C;           Soft	# (Po) COMMA
002E;           Soft	# (Po) FULL STOP
003A..003B;     Soft	# (Po) COLON..(Po) SEMICOLON
003F;           Soft	# (Po) QUESTION MARK
00AB;           Soft	# (Pi) LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
00BB;           Soft	# (Pf) RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
037E;           Soft	# (Po) GREEK QUESTION MARK
0387;           Soft	# (Po) GREEK ANO TELEIA
0589;           Soft	# (Po) ARMENIAN FULL STOP
05C3;           Soft	# (Po) HEBREW PUNCTUATION SOF PASUQ
060C;           Soft	# (Po) ARABIC COMMA
061B;           Soft	# (Po) ARABIC SEMICOLON
061D..061F;     Soft	# (Po) ARABIC END OF TEXT MARK..(Po) ARABIC QUESTION MARK
06D4;           Soft	# (Po) ARABIC FULL STOP
0700..070A;     Soft	# (Po) SYRIAC END OF PARAGRAPH..(Po) SYRIAC CONTRACTION
070C;           Soft	# (Po) SYRIAC HARKLEAN METOBELUS
07F8..07F9;     Soft	# (Po) NKO COMMA..(Po) NKO EXCLAMATION MARK
0830..0835;     Soft	# (Po) SAMARITAN PUNCTUATION NEQUDAA..(Po) SAMARITAN PUNCTUATION SHIYYAALAA
0837..083E;     Soft	# (Po) SAMARITAN PUNCTUATION MELODIC QITSA..(Po) SAMARITAN PUNCTUATION ANNAAU
085E;           Soft	# (Po) MANDAIC PUNCTUATION
0964..0965;     Soft	# (Po) DEVANAGARI DANDA..(Po) DEVANAGARI DOUBLE DANDA
0E5A..0E5B;     Soft	# (Po) THAI CHARACTER ANGKHANKHU..(Po) THAI CHARACTER KHOMUT
0F08;           Soft	# (Po) TIBETAN MARK SBRUL SHAD
0F0D..0F12;     Soft	# (Po) TIBETAN MARK SHAD..(Po) TIBETAN MARK RGYA GRAM SHAD
104A..104B;     Soft	# (Po) MYANMAR SIGN LITTLE SECTION..(Po) MYANMAR SIGN SECTION
1361..1368;     Soft	# (Po) ETHIOPIC WORDSPACE..(Po) ETHIOPIC PARAGRAPH SEPARATOR
166E;           Soft	# (Po) CANADIAN SYLLABICS FULL STOP
16EB..16ED;     Soft	# (Po) RUNIC SINGLE PUNCTUATION..(Po) RUNIC CROSS PUNCTUATION
1735..1736;     Soft	# (Po) PHILIPPINE SINGLE PUNCTUATION..(Po) PHILIPPINE DOUBLE PUNCTUATION
17D4..17D6;     Soft	# (Po) KHMER SIGN KHAN..(Po) KHMER SIGN CAMNUC PII KUUH
17DA;           Soft	# (Po) KHMER SIGN KOOMUUT
1802..1805;     Soft	# (Po) MONGOLIAN COMMA..(Po) MONGOLIAN FOUR DOTS
1808..1809;     Soft	# (Po) MONGOLIAN MANCHU COMMA..(Po) MONGOLIAN MANCHU FULL STOP
1944..1945;     Soft	# (Po) LIMBU EXCLAMATION MARK..(Po) LIMBU QUESTION MARK
1AA8..1AAB;     Soft	# (Po) TAI THAM SIGN KAAN..(Po) TAI THAM SIGN SATKAANKUU
1B4E..1B4F;     Soft	# (Po) BALINESE INVERTED CARIK SIKI..(Po) BALINESE INVERTED CARIK PAREREN
1B5A..1B5B;     Soft	# (Po) BALINESE PANTI..(Po) BALINESE PAMADA
1B5D..1B5F;     Soft	# (Po) BALINESE CARIK PAMUNGKAH..(Po) BALINESE CARIK PAREREN
1B7D..1B7F;     Soft	# (Po) BALINESE PANTI LANTANG..(Po) BALINESE PANTI BAWAK
1C3B..1C3F;     Soft	# (Po) LEPCHA PUNCTUATION TA-ROL..(Po) LEPCHA PUNCTUATION TSHOOK
1C7E..1C7F;     Soft	# (Po) OL CHIKI PUNCTUATION MUCAAD..(Po) OL CHIKI PUNCTUATION DOUBLE MUCAAD
2018..201F;     Soft	# (Pi) LEFT SINGLE QUOTATION MARK..(Pi) DOUBLE HIGH-REVERSED-9 QUOTATION MARK
2024;           Soft	# (Po) ONE DOT LEADER
2039..203A;     Soft	# (Pi) SINGLE LEFT-POINTING ANGLE QUOTATION MARK..(Pf) SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203C..203D;     Soft	# (Po) DOUBLE EXCLAMATION MARK..(Po) INTERROBANG
2047..2049;     Soft	# (Po) DOUBLE QUESTION MARK..(Po) EXCLAMATION QUESTION MARK
2CF9..2CFB;     Soft	# (Po) COPTIC OLD NUBIAN FULL STOP..(Po) COPTIC OLD NUBIAN INDIRECT QUESTION MARK
2E2E;           Soft	# (Po) REVERSED QUESTION MARK
2E3C;           Soft	# (Po) STENOGRAPHIC FULL STOP
2E41;           Soft	# (Po) REVERSED COMMA
2E4C;           Soft	# (Po) MEDIEVAL COMMA
2E4E..2E4F;     Soft	# (Po) PUNCTUS ELEVATUS MARK..(Po) CORNISH VERSE DIVIDER
2E53..2E54;     Soft	# (Po) MEDIEVAL EXCLAMATION MARK..(Po) MEDIEVAL QUESTION MARK
3001..3002;     Soft	# (Po) IDEOGRAPHIC COMMA..(Po) IDEOGRAPHIC FULL STOP
A4FE..A4FF;     Soft	# (Po) LISU PUNCTUATION COMMA..(Po) LISU PUNCTUATION FULL STOP
A60D..A60F;     Soft	# (Po) VAI COMMA..(Po) VAI QUESTION MARK
A6F3..A6F7;     Soft	# (Po) BAMUM FULL STOP..(Po) BAMUM QUESTION MARK
A876..A877;     Soft	# (Po) PHAGS-PA MARK SHAD..(Po) PHAGS-PA MARK DOUBLE SHAD
A8CE..A8CF;     Soft	# (Po) SAURASHTRA DANDA..(Po) SAURASHTRA DOUBLE DANDA
A92F;           Soft	# (Po) KAYAH LI SIGN SHYA
A9C7..A9C9;     Soft	# (Po) JAVANESE PADA PANGKAT..(Po) JAVANESE PADA LUNGSI
AA5D..AA5F;     Soft	# (Po) CHAM PUNCTUATION DANDA..(Po) CHAM PUNCTUATION TRIPLE DANDA
AADF;           Soft	# (Po) TAI VIET SYMBOL KOI KOI
AAF0..AAF1;     Soft	# (Po) MEETEI MAYEK CHEIKHAN..(Po) MEETEI MAYEK AHANG KHUDAM
ABEB;           Soft	# (Po) MEETEI MAYEK CHEIKHEI
FE12;           Soft	# (Po) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
FE15..FE16;     Soft	# (Po) PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK..(Po) PRESENTATION FORM FOR VERTICAL QUESTION MARK
FE50..FE52;     Soft	# (Po) SMALL COMMA..(Po) SMALL FULL STOP
FE54..FE57;     Soft	# (Po) SMALL SEMICOLON..(Po) SMALL EXCLAMATION MARK
FF01;           Soft	# (Po) FULLWIDTH EXCLAMATION MARK
FF0C;           Soft	# (Po) FULLWIDTH COMMA
FF0E;           Soft	# (Po) FULLWIDTH FULL STOP
FF1A..FF1B;     Soft	# (Po) FULLWIDTH COLON..(Po) FULLWIDTH SEMICOLON
FF1F;           Soft	# (Po) FULLWIDTH QUESTION MARK
FF61;           Soft	# (Po) HALFWIDTH IDEOGRAPHIC FULL STOP
FF64;           Soft	# (Po) HALFWIDTH IDEOGRAPHIC COMMA
1039F;          Soft	# (Po) UGARITIC WORD DIVIDER
103D0;          Soft	# (Po) OLD PERSIAN WORD DIVIDER
10857;          Soft	# (Po) IMPERIAL ARAMAIC SECTION SIGN
1091F;          Soft	# (Po) PHOENICIAN WORD SEPARATOR
10A56..10A57;   Soft	# (Po) KHAROSHTHI PUNCTUATION DANDA..(Po) KHAROSHTHI PUNCTUATION DOUBLE DANDA
10AF0..10AF5;   Soft	# (Po) MANICHAEAN PUNCTUATION STAR..(Po) MANICHAEAN PUNCTUATION TWO DOTS
10B3A..10B3F;   Soft	# (Po) TINY TWO DOTS OVER ONE DOT PUNCTUATION..(Po) LARGE ONE RING OVER TWO RINGS PUNCTUATION
10B99..10B9C;   Soft	# (Po) PSALTER PAHLAVI SECTION MARK..(Po) PSALTER PAHLAVI FOUR DOTS WITH DOT
10F55..10F59;   Soft	# (Po) SOGDIAN PUNCTUATION TWO VERTICAL BARS..(Po) SOGDIAN PUNCTUATION HALF CIRCLE WITH DOT
10F86..10F89;   Soft	# (Po) OLD UYGHUR PUNCTUATION BAR..(Po) OLD UYGHUR PUNCTUATION FOUR DOTS
11047..1104D;   Soft	# (Po) BRAHMI DANDA..(Po) BRAHMI PUNCTUATION LOTUS
110BE..110C1;   Soft	# (Po) KAITHI SECTION MARK..(Po) KAITHI DOUBLE DANDA
11141..11143;   Soft	# (Po) CHAKMA DANDA..(Po) CHAKMA QUESTION MARK
111C5..111C6;   Soft	# (Po) SHARADA DANDA..(Po) SHARADA DOUBLE DANDA
111CD;          Soft	# (Po) SHARADA SUTRA MARK
111DE..111DF;   Soft	# (Po) SHARADA SECTION MARK-1..(Po) SHARADA SECTION MARK-2
11238..1123C;   Soft	# (Po) KHOJKI DANDA..(Po) KHOJKI DOUBLE SECTION MARK
112A9;          Soft	# (Po) MULTANI SECTION MARK
113D4..113D5;   Soft	# (Po) TULU-TIGALARI DANDA..(Po) TULU-TIGALARI DOUBLE DANDA
1144B..1144D;   Soft	# (Po) NEWA DANDA..(Po) NEWA COMMA
1145A..1145B;   Soft	# (Po) NEWA DOUBLE COMMA..(Po) NEWA PLACEHOLDER MARK
115C2..115C5;   Soft	# (Po) SIDDHAM DANDA..(Po) SIDDHAM SEPARATOR BAR
115C9..115D7;   Soft	# (Po) SIDDHAM END OF TEXT MARK..(Po) SIDDHAM SECTION MARK WITH CIRCLES AND FOUR ENCLOSURES
11641..11642;   Soft	# (Po) MODI DANDA..(Po) MODI DOUBLE DANDA
1173C..1173E;   Soft	# (Po) AHOM SIGN SMALL SECTION..(Po) AHOM SIGN RULAI
11944;          Soft	# (Po) DIVES AKURU DOUBLE DANDA
11946;          Soft	# (Po) DIVES AKURU END OF TEXT MARK
11A42..11A43;   Soft	# (Po) ZANABAZAR SQUARE MARK SHAD..(Po) ZANABAZAR SQUARE MARK DOUBLE SHAD
11A9B..11A9C;   Soft	# (Po) SOYOMBO MARK SHAD..(Po) SOYOMBO MARK DOUBLE SHAD
11AA1..11AA2;   Soft	# (Po) SOYOMBO TERMINAL MARK-1..(Po) SOYOMBO TERMINAL MARK-2
11C41..11C43;   Soft	# (Po) BHAIKSUKI DANDA..(Po) BHAIKSUKI WORD SEPARATOR
11C71;          Soft	# (Po) MARCHEN MARK SHAD
11EF7..11EF8;   Soft	# (Po) MAKASAR PASSIMBANG..(Po) MAKASAR END OF SECTION
11F43..11F44;   Soft	# (Po) KAWI DANDA..(Po) KAWI DOUBLE DANDA
12470..12474;   Soft	# (Po) CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORD DIVIDER..(Po) CUNEIFORM PUNCTUATION SIGN DIAGONAL QUADCOLON
16A6E..16A6F;   Soft	# (Po) MRO DANDA..(Po) MRO DOUBLE DANDA
16AF5;          Soft	# (Po) BASSA VAH FULL STOP
16B37..16B39;   Soft	# (Po) PAHAWH HMONG SIGN VOS THOM..(Po) PAHAWH HMONG SIGN CIM CHEEM
16B44;          Soft	# (Po) PAHAWH HMONG SIGN XAUS
16D6E..16D6F;   Soft	# (Po) KIRAT RAI DANDA..(Po) KIRAT RAI DOUBLE DANDA
16E97..16E98;   Soft	# (Po) MEDEFAIDRIN COMMA..(Po) MEDEFAIDRIN FULL STOP
1BC9F;          Soft	# (Po) DUPLOYAN PUNCTUATION CHINOOK FULL STOP
1DA87..1DA8A;   Soft	# (Po) SIGNWRITING COMMA..(Po) SIGNWRITING COLON


#	Link_Termination=Close
#   draft = [\p{Bidi_Paired_Bracket_Type=Close}[>]]

0029;           Close	# (Pe) RIGHT PARENTHESIS
003E;           Close	# (Sm) GREATER-THAN SIGN
005D;           Close	# (Pe) RIGHT SQUARE BRACKET
007D;           Close	# (Pe) RIGHT CURLY BRACKET
0F3B;           Close	# (Pe) TIBETAN MARK GUG RTAGS GYAS
0F3D;           Close	# (Pe) TIBETAN MARK ANG KHANG GYAS
169C;           Close	# (Pe) OGHAM REVERSED FEATHER MARK
2046;           Close	# (Pe) RIGHT SQUARE BRACKET WITH QUILL
207E;           Close	# (Pe) SUPERSCRIPT RIGHT PARENTHESIS
208E;           Close	# (Pe) SUBSCRIPT RIGHT PARENTHESIS
2309;           Close	# (Pe) RIGHT CEILING
230B;           Close	# (Pe) RIGHT FLOOR
232A;           Close	# (Pe) RIGHT-POINTING ANGLE BRACKET
2769;           Close	# (Pe) MEDIUM RIGHT PARENTHESIS ORNAMENT
276B;           Close	# (Pe) MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT
276D;           Close	# (Pe) MEDIUM RIGHT-POINTING ANGLE BRACKET ORNAMENT
276F;           Close	# (Pe) HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT
2771;           Close	# (Pe) HEAVY RIGHT-POINTING ANGLE BRACKET ORNAMENT
2773;           Close	# (Pe) LIGHT RIGHT TORTOISE SHELL BRACKET ORNAMENT
2775;           Close	# (Pe) MEDIUM RIGHT CURLY BRACKET ORNAMENT
27C6;           Close	# (Pe) RIGHT S-SHAPED BAG DELIMITER
27E7;           Close	# (Pe) MATHEMATICAL RIGHT WHITE SQUARE BRACKET
27E9;           Close	# (Pe) MATHEMATICAL RIGHT ANGLE BRACKET
27EB;           Close	# (Pe) MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET
27ED;           Close	# (Pe) MATHEMATICAL RIGHT WHITE TORTOISE SHELL BRACKET
27EF;           Close	# (Pe) MATHEMATICAL RIGHT FLATTENED PARENTHESIS
2984;           Close	# (Pe) RIGHT WHITE CURLY BRACKET
2986;           Close	# (Pe) RIGHT WHITE PARENTHESIS
2988;           Close	# (Pe) Z NOTATION RIGHT IMAGE BRACKET
298A;           Close	# (Pe) Z NOTATION RIGHT BINDING BRACKET
298C;           Close	# (Pe) RIGHT SQUARE BRACKET WITH UNDERBAR
298E;           Close	# (Pe) RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
2990;           Close	# (Pe) RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER
2992;           Close	# (Pe) RIGHT ANGLE BRACKET WITH DOT
2994;           Close	# (Pe) RIGHT ARC GREATER-THAN BRACKET
2996;           Close	# (Pe) DOUBLE RIGHT ARC LESS-THAN BRACKET
2998;           Close	# (Pe) RIGHT BLACK TORTOISE SHELL BRACKET
29D9;           Close	# (Pe) RIGHT WIGGLY FENCE
29DB;           Close	# (Pe) RIGHT DOUBLE WIGGLY FENCE
29FD;           Close	# (Pe) RIGHT-POINTING CURVED ANGLE BRACKET
2E23;           Close	# (Pe) TOP RIGHT HALF BRACKET
2E25;           Close	# (Pe) BOTTOM RIGHT HALF BRACKET
2E27;           Close	# (Pe) RIGHT SIDEWAYS U BRACKET
2E29;           Close	# (Pe) RIGHT DOUBLE PARENTHESIS
2E56;           Close	# (Pe) RIGHT SQUARE BRACKET WITH STROKE
2E58;           Close	# (Pe) RIGHT SQUARE BRACKET WITH DOUBLE STROKE
2E5A;           Close	# (Pe) TOP HALF RIGHT PARENTHESIS
2E5C;           Close	# (Pe) BOTTOM HALF RIGHT PARENTHESIS
3009;           Close	# (Pe) RIGHT ANGLE BRACKET
300B;           Close	# (Pe) RIGHT DOUBLE ANGLE BRACKET
300D;           Close	# (Pe) RIGHT CORNER BRACKET
300F;           Close	# (Pe) RIGHT WHITE CORNER BRACKET
3011;           Close	# (Pe) RIGHT BLACK LENTICULAR BRACKET
3015;           Close	# (Pe) RIGHT TORTOISE SHELL BRACKET
3017;           Close	# (Pe) RIGHT WHITE LENTICULAR BRACKET
3019;           Close	# (Pe) RIGHT WHITE TORTOISE SHELL BRACKET
301B;           Close	# (Pe) RIGHT WHITE SQUARE BRACKET
FE5A;           Close	# (Pe) SMALL RIGHT PARENTHESIS
FE5C;           Close	# (Pe) SMALL RIGHT CURLY BRACKET
FE5E;           Close	# (Pe) SMALL RIGHT TORTOISE SHELL BRACKET
FF09;           Close	# (Pe) FULLWIDTH RIGHT PARENTHESIS
FF3D;           Close	# (Pe) FULLWIDTH RIGHT SQUARE BRACKET
FF5D;           Close	# (Pe) FULLWIDTH RIGHT CURLY BRACKET
FF60;           Close	# (Pe) FULLWIDTH RIGHT WHITE PARENTHESIS
FF63;           Close	# (Pe) HALFWIDTH RIGHT CORNER BRACKET


#	Link_Termination=Open
#   draft = [\p{Bidi_Paired_Bracket_Type=Open}[<]]

0028;           Open	# (Ps) LEFT PARENTHESIS
003C;           Open	# (Sm) LESS-THAN SIGN
005B;           Open	# (Ps) LEFT SQUARE BRACKET
007B;           Open	# (Ps) LEFT CURLY BRACKET
0F3A;           Open	# (Ps) TIBETAN MARK GUG RTAGS GYON
0F3C;           Open	# (Ps) TIBETAN MARK ANG KHANG GYON
169B;           Open	# (Ps) OGHAM FEATHER MARK
2045;           Open	# (Ps) LEFT SQUARE BRACKET WITH QUILL
207D;           Open	# (Ps) SUPERSCRIPT LEFT PARENTHESIS
208D;           Open	# (Ps) SUBSCRIPT LEFT PARENTHESIS
2308;           Open	# (Ps) LEFT CEILING
230A;           Open	# (Ps) LEFT FLOOR
2329;           Open	# (Ps) LEFT-POINTING ANGLE BRACKET
2768;           Open	# (Ps) MEDIUM LEFT PARENTHESIS ORNAMENT
276A;           Open	# (Ps) MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT
276C;           Open	# (Ps) MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT
276E;           Open	# (Ps) HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
2770;           Open	# (Ps) HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT
2772;           Open	# (Ps) LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT
2774;           Open	# (Ps) MEDIUM LEFT CURLY BRACKET ORNAMENT
27C5;           Open	# (Ps) LEFT S-SHAPED BAG DELIMITER
27E6;           Open	# (Ps) MATHEMATICAL LEFT WHITE SQUARE BRACKET
27E8;           Open	# (Ps) MATHEMATICAL LEFT ANGLE BRACKET
27EA;           Open	# (Ps) MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
27EC;           Open	# (Ps) MATHEMATICAL LEFT WHITE TORTOISE SHELL BRACKET
27EE;           Open	# (Ps) MATHEMATICAL LEFT FLATTENED PARENTHESIS
2983;           Open	# (Ps) LEFT WHITE CURLY BRACKET
2985;           Open	# (Ps) LEFT WHITE PARENTHESIS
2987;           Open	# (Ps) Z NOTATION LEFT IMAGE BRACKET
2989;           Open	# (Ps) Z NOTATION LEFT BINDING BRACKET
298B;           Open	# (Ps) LEFT SQUARE BRACKET WITH UNDERBAR
298D;           Open	# (Ps) LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
298F;           Open	# (Ps) LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
2991;           Open	# (Ps) LEFT ANGLE BRACKET WITH DOT
2993;           Open	# (Ps) LEFT ARC LESS-THAN BRACKET
2995;           Open	# (Ps) DOUBLE LEFT ARC GREATER-THAN BRACKET
2997;           Open	# (Ps) LEFT BLACK TORTOISE SHELL BRACKET
29D8;           Open	# (Ps) LEFT WIGGLY FENCE
29DA;           Open	# (Ps) LEFT DOUBLE WIGGLY FENCE
29FC;           Open	# (Ps) LEFT-POINTING CURVED ANGLE BRACKET
2E22;           Open	# (Ps) TOP LEFT HALF BRACKET
2E24;           Open	# (Ps) BOTTOM LEFT HALF BRACKET
2E26;           Open	# (Ps) LEFT SIDEWAYS U BRACKET
2E28;           Open	# (Ps) LEFT DOUBLE PARENTHESIS
2E55;           Open	# (Ps) LEFT SQUARE BRACKET WITH STROKE
2E57;           Open	# (Ps) LEFT SQUARE BRACKET WITH DOUBLE STROKE
2E59;           Open	# (Ps) TOP HALF LEFT PARENTHESIS
2E5B;           Open	# (Ps) BOTTOM HALF LEFT PARENTHESIS
3008;           Open	# (Ps) LEFT ANGLE BRACKET
300A;           Open	# (Ps) LEFT DOUBLE ANGLE BRACKET
300C;           Open	# (Ps) LEFT CORNER BRACKET
300E;           Open	# (Ps) LEFT WHITE CORNER BRACKET
3010;           Open	# (Ps) LEFT BLACK LENTICULAR BRACKET
3014;           Open	# (Ps) LEFT TORTOISE SHELL BRACKET
3016;           Open	# (Ps) LEFT WHITE LENTICULAR BRACKET
3018;           Open	# (Ps) LEFT WHITE TORTOISE SHELL BRACKET
301A;           Open	# (Ps) LEFT WHITE SQUARE BRACKET
FE59;           Open	# (Ps) SMALL LEFT PARENTHESIS
FE5B;           Open	# (Ps) SMALL LEFT CURLY BRACKET
FE5D;           Open	# (Ps) SMALL LEFT TORTOISE SHELL BRACKET
FF08;           Open	# (Ps) FULLWIDTH LEFT PARENTHESIS
FF3B;           Open	# (Ps) FULLWIDTH LEFT SQUARE BRACKET
FF5B;           Open	# (Ps) FULLWIDTH LEFT CURLY BRACKET
FF5F;           Open	# (Ps) FULLWIDTH LEFT WHITE PARENTHESIS
FF62;           Open	# (Ps) HALFWIDTH LEFT CORNER BRACKET


#	Link_Paired_Opener
#   draft = BidiPairedBracket + (“>” GREATER-THAN SIGN 🡆  “<” LESS-THAN SIGN)

0029;   0028	# “)” RIGHT PARENTHESIS 🡆  “(” LEFT PARENTHESIS
003E;   003C	# “>” GREATER-THAN SIGN 🡆  “<” LESS-THAN SIGN
005D;   005B	# “]” RIGHT SQUARE BRACKET 🡆  “[” LEFT SQUARE BRACKET
007D;   007B	# “}” RIGHT CURLY BRACKET 🡆  “{” LEFT CURLY BRACKET
0F3B;   0F3A	# “༻” TIBETAN MARK GUG RTAGS GYAS 🡆  “༺” TIBETAN MARK GUG RTAGS GYON
0F3D;   0F3C	# “༽” TIBETAN MARK ANG KHANG GYAS 🡆  “༼” TIBETAN MARK ANG KHANG GYON
169C;   169B	# “᚜” OGHAM REVERSED FEATHER MARK 🡆  “᚛” OGHAM FEATHER MARK
2046;   2045	# “⁆” RIGHT SQUARE BRACKET WITH QUILL 🡆  “⁅” LEFT SQUARE BRACKET WITH QUILL
207E;   207D	# “⁾” SUPERSCRIPT RIGHT PARENTHESIS 🡆  “⁽” SUPERSCRIPT LEFT PARENTHESIS
208E;   208D	# “₎” SUBSCRIPT RIGHT PARENTHESIS 🡆  “₍” SUBSCRIPT LEFT PARENTHESIS
2309;   2308	# “⌉” RIGHT CEILING 🡆  “⌈” LEFT CEILING
230B;   230A	# “⌋” RIGHT FLOOR 🡆  “⌊” LEFT FLOOR
232A;   2329	# “〉” RIGHT-POINTING ANGLE BRACKET 🡆  “〈” LEFT-POINTING ANGLE BRACKET
2769;   2768	# “❩” MEDIUM RIGHT PARENTHESIS ORNAMENT 🡆  “❨” MEDIUM LEFT PARENTHESIS ORNAMENT
276B;   276A	# “❫” MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT 🡆  “❪” MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT
276D;   276C	# “❭” MEDIUM RIGHT-POINTING ANGLE BRACKET ORNAMENT 🡆  “❬” MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT
276F;   276E	# “❯” HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT 🡆  “❮” HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
2771;   2770	# “❱” HEAVY RIGHT-POINTING ANGLE BRACKET ORNAMENT 🡆  “❰” HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT
2773;   2772	# “❳” LIGHT RIGHT TORTOISE SHELL BRACKET ORNAMENT 🡆  “❲” LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT
2775;   2774	# “❵” MEDIUM RIGHT CURLY BRACKET ORNAMENT 🡆  “❴” MEDIUM LEFT CURLY BRACKET ORNAMENT
27C6;   27C5	# “⟆” RIGHT S-SHAPED BAG DELIMITER 🡆  “⟅” LEFT S-SHAPED BAG DELIMITER
27E7;   27E6	# “⟧” MATHEMATICAL RIGHT WHITE SQUARE BRACKET 🡆  “⟦” MATHEMATICAL LEFT WHITE SQUARE BRACKET
27E9;   27E8	# “⟩” MATHEMATICAL RIGHT ANGLE BRACKET 🡆  “⟨” MATHEMATICAL LEFT ANGLE BRACKET
27EB;   27EA	# “⟫” MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET 🡆  “⟪” MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
27ED;   27EC	# “⟭” MATHEMATICAL RIGHT WHITE TORTOISE SHELL BRACKET 🡆  “⟬” MATHEMATICAL LEFT WHITE TORTOISE SHELL BRACKET
27EF;   27EE	# “⟯” MATHEMATICAL RIGHT FLATTENED PARENTHESIS 🡆  “⟮” MATHEMATICAL LEFT FLATTENED PARENTHESIS
2984;   2983	# “⦄” RIGHT WHITE CURLY BRACKET 🡆  “⦃” LEFT WHITE CURLY BRACKET
2986;   2985	# “⦆” RIGHT WHITE PARENTHESIS 🡆  “⦅” LEFT WHITE PARENTHESIS
2988;   2987	# “⦈” Z NOTATION RIGHT IMAGE BRACKET 🡆  “⦇” Z NOTATION LEFT IMAGE BRACKET
298A;   2989	# “⦊” Z NOTATION RIGHT BINDING BRACKET 🡆  “⦉” Z NOTATION LEFT BINDING BRACKET
298C;   298B	# “⦌” RIGHT SQUARE BRACKET WITH UNDERBAR 🡆  “⦋” LEFT SQUARE BRACKET WITH UNDERBAR
298E;   298F	# “⦎” RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 🡆  “⦏” LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
2990;   298D	# “⦐” RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER 🡆  “⦍” LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
2992;   2991	# “⦒” RIGHT ANGLE BRACKET WITH DOT 🡆  “⦑” LEFT ANGLE BRACKET WITH DOT
2994;   2993	# “⦔” RIGHT ARC GREATER-THAN BRACKET 🡆  “⦓” LEFT ARC LESS-THAN BRACKET
2996;   2995	# “⦖” DOUBLE RIGHT ARC LESS-THAN BRACKET 🡆  “⦕” DOUBLE LEFT ARC GREATER-THAN BRACKET
2998;   2997	# “⦘” RIGHT BLACK TORTOISE SHELL BRACKET 🡆  “⦗” LEFT BLACK TORTOISE SHELL BRACKET
29D9;   29D8	# “⧙” RIGHT WIGGLY FENCE 🡆  “⧘” LEFT WIGGLY FENCE
29DB;   29DA	# “⧛” RIGHT DOUBLE WIGGLY FENCE 🡆  “⧚” LEFT DOUBLE WIGGLY FENCE
29FD;   29FC	# “⧽” RIGHT-POINTING CURVED ANGLE BRACKET 🡆  “⧼” LEFT-POINTING CURVED ANGLE BRACKET
2E23;   2E22	# “⸣” TOP RIGHT HALF BRACKET 🡆  “⸢” TOP LEFT HALF BRACKET
2E25;   2E24	# “⸥” BOTTOM RIGHT HALF BRACKET 🡆  “⸤” BOTTOM LEFT HALF BRACKET
2E27;   2E26	# “⸧” RIGHT SIDEWAYS U BRACKET 🡆  “⸦” LEFT SIDEWAYS U BRACKET
2E29;   2E28	# “⸩” RIGHT DOUBLE PARENTHESIS 🡆  “⸨” LEFT DOUBLE PARENTHESIS
2E56;   2E55	# “⹖” RIGHT SQUARE BRACKET WITH STROKE 🡆  “⹕” LEFT SQUARE BRACKET WITH STROKE
2E58;   2E57	# “⹘” RIGHT SQUARE BRACKET WITH DOUBLE STROKE 🡆  “⹗” LEFT SQUARE BRACKET WITH DOUBLE STROKE
2E5A;   2E59	# “⹚” TOP HALF RIGHT PARENTHESIS 🡆  “⹙” TOP HALF LEFT PARENTHESIS
2E5C;   2E5B	# “⹜” BOTTOM HALF RIGHT PARENTHESIS 🡆  “⹛” BOTTOM HALF LEFT PARENTHESIS
3009;   3008	# “〉” RIGHT ANGLE BRACKET 🡆  “〈” LEFT ANGLE BRACKET
300B;   300A	# “》” RIGHT DOUBLE ANGLE BRACKET 🡆  “《” LEFT DOUBLE ANGLE BRACKET
300D;   300C	# “」” RIGHT CORNER BRACKET 🡆  “「” LEFT CORNER BRACKET
300F;   300E	# “』” RIGHT WHITE CORNER BRACKET 🡆  “『” LEFT WHITE CORNER BRACKET
3011;   3010	# “】” RIGHT BLACK LENTICULAR BRACKET 🡆  “【” LEFT BLACK LENTICULAR BRACKET
3015;   3014	# “〕” RIGHT TORTOISE SHELL BRACKET 🡆  “〔” LEFT TORTOISE SHELL BRACKET
3017;   3016	# “〗” RIGHT WHITE LENTICULAR BRACKET 🡆  “〖” LEFT WHITE LENTICULAR BRACKET
3019;   3018	# “〙” RIGHT WHITE TORTOISE SHELL BRACKET 🡆  “〘” LEFT WHITE TORTOISE SHELL BRACKET
301B;   301A	# “〛” RIGHT WHITE SQUARE BRACKET 🡆  “〚” LEFT WHITE SQUARE BRACKET
FE5A;   FE59	# “﹚” SMALL RIGHT PARENTHESIS 🡆  “﹙” SMALL LEFT PARENTHESIS
FE5C;   FE5B	# “﹜” SMALL RIGHT CURLY BRACKET 🡆  “﹛” SMALL LEFT CURLY BRACKET
FE5E;   FE5D	# “﹞” SMALL RIGHT TORTOISE SHELL BRACKET 🡆  “﹝” SMALL LEFT TORTOISE SHELL BRACKET
FF09;   FF08	# “）” FULLWIDTH RIGHT PARENTHESIS 🡆  “（” FULLWIDTH LEFT PARENTHESIS
FF3D;   FF3B	# “］” FULLWIDTH RIGHT SQUARE BRACKET 🡆  “［” FULLWIDTH LEFT SQUARE BRACKET
FF5D;   FF5B	# “｝” FULLWIDTH RIGHT CURLY BRACKET 🡆  “｛” FULLWIDTH LEFT CURLY BRACKET
FF60;   FF5F	# “｠” FULLWIDTH RIGHT WHITE PARENTHESIS 🡆  “｟” FULLWIDTH LEFT WHITE PARENTHESIS
FF63;   FF62	# “｣” HALFWIDTH RIGHT CORNER BRACKET 🡆  “｢” HALFWIDTH LEFT CORNER BRACKET

For comparison to the related General_Category values, see the characters in:

7 Test Data

TBD: The plan is to have two types of test lines, something like the following.

@Linkification
# Field 0: Source
# Field 1: Expected Linkification, where:
	⸠ is at the start, and 
	⸡ is at the end

See example.com! on…;	See ⸠example.com! on…
See example.com/αβγ on…;	See ⸠example.com/αβγ⸡ on…
See example.com?αβγ on…;	See ⸠example.com?αβγ⸡ on…
See example.com#αβγ on…;	See ⸠example.com#αβγ⸡ on…
See example.com/αβγ/δεζ?θικ#λμν on…;	See ⸠example.com/αβγ/δεζ?θικ#λμν⸡ on…
See example.com/αβγ/δεζ?δ.εφ#λμν on…;	See ⸠example.com/αβγ/δεζ?δ.εφ#λμν⸡ on…
See example.com/αβγ/δεζ?δ εφ#λμν on…;	See ⸠example.com/αβγ/δεζ?δ⸡ εφ#λμν on…	# Break on hard (' ')
See example.com/αβγ/δεζ?δ. εφ#λμν on…;	See ⸠example.com/αβγ/δεζ?δ⸡. εφ#λμν on…	# Break on soft ('.') followed by hard (' ')
See example.com/α/βγ?δ/ε?ζ#λ/μ?ν#π on…;	See ⸠example.com/α/βγ?δ/ε?ζ#λ/μ?ν#π⸡ on…
See example.com/αβ) on…;	See ⸠example.com/αβ⸡) on…	# Break on unmatched bracket
See example.com/α(β) on…;	See ⸠example.com/α(β)⸡ on…	# Include matched bracket
See example.com/αβ(γ/δ)ρς?θικ#λμν on…;	See ⸠example.com/αβ(γ/δ)ρς?θικ#λμν⸡ on…	# Includes matching across interior syntax — consider changing


@Minimal-Escaping
# Field 0: Path
# Field 1: Query
# Field 2: Fragment
# Field 4: Expected result

https://example.com;	α;	;	;	https://example.com/α	# Path only
https://example.com;	;	α;	;	https://example.com?α	# Query only
https://example.com;	;	;	α;	https://example.com#α	# Fragment only
https://example.com;	αβγ/δεζ;	θ=ικλ&μ=γξο;	πρς;	https://example.com/αβγ/δεζ?θ=ικλ&μ=γξο#πρς	# All parts
https://example.com;	α?μπ;	;	;	https://example.com/α%3Fμπ	# Escape ? in Path
https://example.com;	α#β;	γ=δ#ε;	;	https://example.com/α%23β?γ=δ%23ε	# Escape # in Path/Query
https://example.com;	αβ γ/δεζ;	θ=ικ λ&=γξο;	πρ σ;	https://example.com/αβ%20γ/δεζ?θ=ικ%20λ&=γξο#πρ%20σ	# Escape hard (' ')
https://example.com;	αβγ./δεζ.;	θ=ικ.λ&=γξο.;	πρς.;	https://example.com/αβγ./δεζ.?θ=ικ.λ&=γξο.#πρς%2E	# Escape soft ('.') unless followed by include
https://example.com;	α(β));	γ(δ));	ε(ζ));	https://example.com/α(β)%29?γ(δ)%29#ε(ζ)%29	# Escape unmatched brackets

Review Issues

Scripts sans spaces

For scripts that don’t need spaces between words, it is a bit tricky to linkify within sentences. For example, take:

https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン is an important page.

The URL is set off from the rest of the text. But then look at it in the equivalent Japanese sentence:

https://ja.wikipedia.org/wiki/アルベルト・アインシュタインは重要なページです

[Ed Note: TBD get a better example from a native speaker.]

That would not maintain a separation between the text if simply substituted for x in a phrase like “xは重要なページです” — so the linkification would go too far. One would need some kind of separator character to separate the text. That can be done with Hard characters (eg, space):

https://ja.wikipedia.org/wiki/アルベルト・アインシュタインは重要なページです

Or with Close characters, such as:

『https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン』は重要なページです

One could consider modifying the algorithm to provide for a termination between non-spacing scripts and spacing scripts. That wouldn’t help with the above examples, but would help with cases like:

https://en.wikipedia.org/wiki/Albert_Einsteinは重要なページです

However, that would complicate the behavior for little overall benefit.

Quotation Marks

One might consider adding quotation marks to Open/Close, but that would make the algorithm much more complicated. The problem is that the items are not uniquely Close or Open and the pairings are not 1:1 in natural languages. So these characters are categorized as Soft. Examples:

Open(s)				Close
"				"
'				'
„				“
‚				‘
‟	“	”	„	”
‛	‘	’	‚	’
‹				›
›				‹
«				»
»				«

There is a further complication, that some quotation marks appear in non-paired usage, such as RIGHT SINGLE QUOTATION MARK or APOSTROPHE, but also QUOTATION MARK as an alternative to HEBREW PUNCTUATION GERSHAYIM. The simplest and most predictable solution is to have them be Soft.

Angle Brackets

The < and > characters are added to Link_Paired_Opener to set off URLS, such as <https://eel.is/c++draft/vector.bool.pspc#lib:vector<bool>> and <https://wg21.link/p2348>. While many sources that formerly recommended that practice no longer do (such as the Chicago Manual of Style), others have continued the practice, such as in C++ sg16.

References

TBD

Acknowledgments

TBD

Modifications

The following summarizes modifications from the previous revision of this document.

Post working-draft L2/24-217, based on discussion during the UTC #181 meeting.

Problematic links were unlinked (they still have a highlight, but aren't active)
Added the 2nd conformance clause in Conformance
Fleshed out Minimal Escaping
Made a substantive fix to Termination Algorithm (to “If LT == Open”).
Fleshed out the review note in Security to be more specific about the contexts for the two examples mentioned (ZWJ/ZWNJ, and TAG characters), and add a note about matching brackets across syntax characters.
Added draft samples in Test Data
Various copy-edits

Modifications for previous versions are listed in those respective versions.

© 2024–2024 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.

Use of all Unicode Products, including this publication, is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.

Open(s)				Close
"				"
'				'
„				“
‚				‘
‟	“	”	„	”
‛	‘	’	‚	’
‹				›
›				‹
«				»
»				«

Open(s)				Close
"				"
'				'
„				“
‚				‘
‟	“	”	„	”
‛	‘	’	‚	’
‹				›
›				‹
«				»
»				«

Open(s)				Close
"				"
'				'
„				“
‚				‘
‟	“	”	„	”
‛	‘	’	‚	’
‹				›
›				‹
«				»
»				«