[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #6976(accepted task)

Opened 3 years ago

Last modified 16 months ago

Use precompiled, common regex [branch only]

Reported by: mark Owned by: googler
Component: perf Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

We don't make the most efficient use of regex in our code.

  1. Split

Instead of

z = x.split(Y); for some constant string Y

we should have

public static final Pattern Y_PAT = Pattern.compile(Y);

...

z = Y_PAT.split(x);

For example, instead of ...split("
s+") in lots of files, we should have in one place:

public static final Pattern REGEX_WHITE_SPACE = Pattern.compile("
s+");

and then replace the string.split() call sites.

  1. Replace

Similarly, instead of

z = x.replaceAll("a", "b"); or .replaceFirst

We should have in one place:

public static final Pattern A_PAT = Pattern.compile("a");

and then fix the call site to be:

z = A_PAT.matcher(x).replaceAll("b");

These can be done by searching for .split, .replaceAll, and .replaceFirst, and putting the common patterns in, say, CLDRUtilities.java.

Attachments

Change History

comment:1 Changed 3 years ago by emmons

  • Status changed from new to assigned
  • Component changed from unknown to tools
  • Priority changed from assess to minor
  • Milestone changed from UNSCH to 25final
  • Owner changed from anybody to ribnitz
  • Type changed from unknown to task

comment:2 Changed 3 years ago by mark

  • Milestone changed from 25final to 26dsub

comment:3 Changed 3 years ago by mark

  • Priority changed from minor to major

comment:4 Changed 3 years ago by mark

  • Component changed from tools to perf

comment:5 Changed 3 years ago by ribnitz

  • Status changed from assigned to reviewing
  • Review set to mark
  • Common patterns moved to Patterns.java; this class also contains static methods for splitting on a Pattern; the pattern replaced are striclty equal to the original Patterns used
  • Code was instrumented to use the Patterns class
  • For the more common uses, Guava classes were introduced for doing the splitting, as they outperform the resp. Java class.
  • In some cases, for loops were rewritten to use the extended form that uses Iterables, as an Iterable does not need copying the data, while an Array does.

comment:6 Changed 3 years ago by mark

You are using some inefficient methods on Splitter.

  1. The static methods create a new object each time. We should instead have static final Splitter objects. That is, you should never be using Splitter.on(...)
  2. The methods that take a pattern have to use regex, whereas many times we split on chars. In the cases where we also trim on spaces (eg "
    s+;
    s+", then the trimResults() can be used.
  3. There is already a splitToList, so the call sites that really want a list could use that.
  4. After all this, we may not need some of the Patterns at all, and you can drop the ones we don't actually use somewhere in our code.

comment:7 Changed 3 years ago by mark

  • Status changed from reviewing to accepted

comment:8 Changed 3 years ago by emmons

  • Milestone changed from 26dsub to 26dvet

Moving all 26dsub to 26dvet. Please assess the need to complete tickets by 26dvet, which is 2014-06-19

comment:9 Changed 3 years ago by ribnitz

  • Status changed from accepted to reviewing

comment:10 Changed 3 years ago by mark

  • Status changed from reviewing to accepted

Generally good, just needs a bit of cleanup.

http://unicode.org/cldr/trac/browser/branches/ribnitz_CommonRegexp_6976/tools/java/org/unicode/cldr/util/Patterns.java

private static Iterable<String> splitToIterable(Splitter sp,String s) {

Inline this. Private function of no value; just another function call.

General.

There are a number of places where the results of your function are fed into an Arrays.asList. Would be cleaner and faster to have a method that uses the splitter and calls .splitToList() instead of creating an iterator then an array then a list.

http://unicode.org/cldr/trac/changeset?old=9953&old_path=trunk%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Fdraft%2FGeneratePickerData.java&new_path=branches%2Fribnitz_CommonRegexp_6976%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Fdraft%2FGeneratePickerData.java&new=10026

1722 File f=new File(fileName);
1723 if (f.canRead()) {
1724
FileOpeningCounter.getInstance().add(fileName);
1725 }

Don't understand this. Looks like you make f, but never use it.

General issue in many files. You leave a trail of alternative versions. Makes the file hard to read.

1778 String[] components = Patterns.splitOnSemicolon(line);
1779 String[] components = Patterns.SEMICOLON.split(line);
1780
String[] components = line.split(";");
1781 String[] codepoints = components[0].split(" ");
1782
String[] codepoints =Patterns.SPACE_CHARACTER.split(components[0]);
1783 String[] codepoints =Patterns.splitOnSpaceCharacter(components[0]);

Instead, move the *original* line to the javadoc for the method.

=>

String[] components = Patterns.splitOnSemicolon(line);

String[] codepoints =Patterns.splitOnSpaceCharacter(components[0]);

And in Patterns, make sure that each has a javadoc with the equivalent regex

http://unicode.org/cldr/trac/browser/branches/ribnitz_CommonRegexp_6976/tools/java/org/unicode/cldr/util/Patterns.java

/

  • Equivalent of s.split(";");

*/
public static String[] splitOnSemicolon(String s) {

...

Some of the replaces will be slower, because you are creating Matchers and calling functions, where there is a fast String method for replacing single chars. Why are you doing this?

OLD
String dir = specialsDir.replace('
', '/');

YOURS (slower)
String dir = replaceBackslash(specialsDir);
...
private String replaceBackslash(String dirStr) {

return BACKSLASH.matcher(dirStr).replaceAll("/");

}

http://unicode.org/cldr/trac/changeset?old=9957&old_path=trunk%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Ftool%2FPluralRulesFactory.java&new_path=branches%2Fribnitz_CommonRegexp_6976%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Ftool%2FPluralRulesFactory.java&new=10025

Looks like you missed
private final static Pattern COMMA_WITH_SPACING=Pattern.compile("
s*,
s*");


The following look unrelated. What ticket is it supposed to be for?
If they are filed under the wrong ticket, you can file a ticket to ask Steven to fix.

http://unicode.org/cldr/trac/changeset?old=9953&old_path=trunk%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Ftest%2FCheckDisplayCollisions.java&new_path=branches%2Fribnitz_CommonRegexp_6976%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Ftest%2FCheckDisplayCollisions.java&new=10021

http://unicode.org/cldr/trac/changeset?old=9953&old_path=trunk%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Futil%2FSimpleFactory.java&new_path=branches%2Fribnitz_CommonRegexp_6976%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Futil%2FSimpleFactory.java&new=10026

http://unicode.org/cldr/trac/changeset?old=9953&old_path=trunk%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Ftest%2FCheckWidths.java&new_path=branches%2Fribnitz_CommonRegexp_6976%2Ftools%2Fjava%2Forg%2Funicode%2Fcldr%2Ftest%2FCheckWidths.java&new=10021

comment:11 Changed 3 years ago by ribnitz

See
http://unicode.org/cldr/trac/changeset/10420
and
http://unicode.org/cldr/trac/changeset/10421

Issues addesssed:

  • Removed trail
  • Added comments
  • Inlined splitToIterable(); added a split...toList() for the 2-3 cases where constructs such as Arrays.asList(Patterns.splitOnWhitespace("a b c d")) can be simplified
  • Removed trimResultArray(), as it is no longer necessary.
  • Added splitOnCommaWithSpacing(), including the resp. splitter
  • Adapted a few dependent classes (as in code cleanup, not as in functional change)

Minor improvement:
Now that we have PatternCache, it is possible to construct the splitter using the regex, which can be obtained from PatternCache.

comment:12 Changed 3 years ago by mark

  • Owner changed from ribnitz to googler

comment:13 Changed 3 years ago by mark

  • Milestone changed from 26dvet to 27dsub

comment:14 Changed 3 years ago by markus

  • Phase set to dsub
  • Milestone changed from 27dsub to 27

comment:15 Changed 2 years ago by emmons

  • Milestone changed from 27 to UNSCH

comment:16 Changed 16 months ago by mark

  • Review mark deleted
  • Summary changed from Use precompiled, common regex. to Use precompiled, common regex [branch only]
View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.