[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #4452(accepted defect)

Opened 6 years ago

Last modified 3 years ago

Possible regex speedups

Reported by: mark Owned by: googler
Component: perf Data Locale:
Phase: dsub Review:
Weeks: 1 Data Xpath:
Xref:

Description (last modified by mark) (diff)

We make heavy use of regexes. There is a potential speedup by using possessive quantifiers.

The possessive quantifiers are in http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
See also http://www.regular-expressions.info/possessive.html

Example:

^//ldml/localeDisplayNames/variants/variant\[@type="([^"]*)"]

Since we are matching a whole attribute value, could be the following to avoid backtrack.

^//ldml/localeDisplayNames/variants/variant\[@type="([^"]*+)"]

Attachments

Change History

comment:1 Changed 6 years ago by mark

  • Description modified (diff)

comment:2 Changed 6 years ago by mark

  • Owner changed from somebody to mark
  • Priority changed from assess to medium
  • Status changed from new to assigned
  • Component changed from unknown to data
  • Milestone changed from UNSCH to 22

Mark to do in some code; others should look for these and file bugs.

comment:3 Changed 6 years ago by mark

  • Weeks set to 1

comment:4 Changed 6 years ago by mark

  • Milestone changed from 22 to 23dsub

comment:5 Changed 5 years ago by emmons

  • Milestone changed from 23dsub to 23dres

comment:6 Changed 5 years ago by mark

  • Milestone changed from 23dres to cleanup

comment:7 Changed 4 years ago by mark

  • Component changed from data to perf

comment:8 Changed 4 years ago by mark

  • Owner changed from mark to ribnitz
  • Milestone changed from cleanup to 25final

comment:9 Changed 4 years ago by mark

  • Milestone changed from 25final to 26dsub

comment:10 Changed 4 years ago by mark

For RegexLookup, we could also try looking at the constant initial substrings. For example,

abcde(x|y)  => X
abc(z*)   => Y
abcd((e|f]x)* => Z
cat(a*) => W

could be preprocessed to separate out the constant sections, and use a Trie2 to map them to the remainders.

NOTE: with java API we can test which expressions can match which initial substrings. So we can look at the initial substrings and find out which full regexes to try. So we could end up with a Trie2 with mappings like:

abcde 
  => 
   abcde(x|y)  => X
   abcd((e|f]x)* => Z
abc
  =>
    abc(z*)   => Y
cat
  =>
    cat(a*) => W

Note: we use a restricted vocabulary of regex syntax anyway, and if it helps performance, we can look at restricting it further!

comment:11 Changed 4 years ago by mark

  • Priority changed from medium to major

comment:12 Changed 4 years ago by emmons

  • Milestone changed from 26dsub to 26dvet

Moving all 26dsub to 26dvet. Please assess the need to complete tickets by 26dvet, which is 2014-06-19

comment:13 Changed 4 years ago by mark

  • Owner changed from ribnitz to googler

comment:14 Changed 4 years ago by mark

  • Milestone changed from 26dvet to 27dsub

comment:15 Changed 4 years ago by markus

  • Phase set to dsub
  • Milestone changed from 27dsub to 27

comment:16 Changed 3 years ago by emmons

  • Milestone changed from 27 to UNSCH

comment:17 Changed 3 years ago by srl

  • Status changed from assigned to accepted
View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.