Description
Describe your feature request
I'm unsure if I should report this as a feature request for a currently-missing "optimization", or a performance bug due to a missed optimization. The feature request template is easier to fill out, so... Here we are.
TLDR: Matching against a long list of strings is slower than it should be. Manually (or automatically) optimizing it in ways that a computer should be able to easily perform (see the giant gross regexes in code blocks) seems to have up to a ~2x performance benefit in some cases, such as the provided benchmark.
I'm working on a crate that performs syntax highlighting of source code (something like a rust equivalent to pygments).
As you may or may not imagine, it performs a lot of regex matching. One of the most situations in this kind of thing is "match any of the words in this medium-sized list" (usually on the order of 10s of items — probably very rarely more than 100, if ever).
That is, this is trying to match a regex which conceptually looks something like \b(foo|bar|baz|quux|...)\b
, and it's typically used for all kinds of things: keywords, builtin functions/types, well-known stuff from a language's stdlib, special constants, ...
Apparently from what I gather, historically this sort of regex has been a big pain point in terms of performance for other regex-based highlighting engines. Most of them seem to have some comments or guidance around it (often "avoid these"). Interestingly, pygments has a function that compiles an optimized regex from a list of matching words. The code for this is here: https://github.com/pygments/pygments/blob/faf69c028f0efa7798a938163e8432d838acfda2/pygments/regexopt.py
I had hoped that this would be unnecessary for the regex
crate. Even saying something in a fit of hubris like "It'll be fine..." and reassuring myself that "some sort of aho-corasick-ey thing" will save the day, and that "the python code seems like it's an ad-hoc finite automata simplification", which I don't have to worry about because the regex crate is so good at that stuff. (A combination of: making things up to sound smart and stating things as facts because I hope that they are true).
Anyway, I wasn't going to completely handwave it, so I did a smoke test with a benchmark, and (you can probably guess the results already, since I'm here and I wrote them in the TLDR above)... It seems like it's in fact not entirely fine.
The concrete numbers on my machine were (... you can tell from the benchmark names that this didn't quite turn out how I had expected):
test naive ... bench: 2,177,060 ns/iter (+/- 248,300)
test quote_optimized_unquote ... bench: 1,148,483 ns/iter (+/- 350,485)
Specifically, I get a 2x speedup from using this (generated using the python pygments.regex_opt
algorithm linked above):
\b(Self|a(?:bstract|s)|b(?:ecome|o(?:ol|x)|reak)|c(?:har|on(?:st|tinue)|rate)|do|e(?:lse|num|xtern)|f(?:32|64|alse|inal|n|or)|i(?:1(?:28|6)|32|64|mpl|size|[8fn])|l(?:et|oop)|m(?:a(?:cro|tch)|o(?:d|ve)|ut)|override|p(?:riv|ub)|re(?:f|turn)|s(?:elf|t(?:atic|r(?:(?:uct)?))|uper)|t(?:r(?:ait|ue|y)|ype(?:(?:of)?))|u(?:1(?:28|6)|32|64|8|ns(?:afe|ized)|s(?:(?:(?:iz)?)e))|virtual|wh(?:(?:er|il)e)|yield)\b
Compared to this (generated by putting some clothes around the result of words.join('|')
:
\b(as|break|const|continue|crate|else|enum|extern|false|fn|for|if|impl|in|let|loop|match|mod|move|mut|pub|ref|return|self|Self|static|struct|super|trait|true|type|unsafe|use|where|while|abstract|become|box|do|final|macro|override|priv|typeof|unsized|virtual|yield|try|i8|i16|i32|i64|i128|isize|u8|u16|u32|u64|u128|usize|bool|char|str|f32|f64)\b
(These match something vaguely shaped like the list of Rust keywords and builtin types)
The runnable (after a small amount of legwork) benchmark is available here: https://gist.github.com/thomcc/eeb77d34d8924444b5778d2e451caac6
(Note that the actual operation it's benching isn't really what the highlighting loop looks like, although I guess it's not entirely different either)
Anyway, now that all that rambling and background si out of the way, you can probably guess what I'm asking for, but just to be concrete:
I would for to not need to perform this kind of optimization on the input. It seems like perhaps the regex crate could perform something like it itself, or maybe something else could be done.
P.S. Psst, I uh, also wouldn't mind some pointers for a workaround, if you've got any (unless it's likely to be an easy fix, anyway). Or more generally, advice for how to match against a long list of words efficiently:
- Should I do nothing because the numbers I saw are atypical and in practice this optimization won't generally help?
- Should I slightly rephrase the regexs so that another optimization can kick in? (I have some flexibility here, and can plausibly avoid the leading
\b
with some effort, for example). - Should I just port that python optimization to my code, and just run it on the input word-lists before tossing them to
regex
? - Should I instead use the aho_corasick (and/or
fst
) crate directly, and then figure out word boundaries after-the-fact usingcryingguessingmy wit and ingenuity?