Description
Hello and thank you for your work!
I work on different projects that use the regex crate intensively and I recently started doing some comparisons with other regex engines around different aspects, mainly syntax, performance and memory usage.
That comparison brought some attention of what could be considered "invalid" syntax that is currently accepted by this crate.
I will leave it to you as if those examples should be accepted or not:
^*\.google\.com$
: repetition of start anchor: accepted by golang regex engine, refused by hyperscan, refused by python (same question applies for other kind of repeat modifiers like?
,+
)a**\.google\.com$
: multiple consecutive same repeat modifier (you can specify as many): refused by golang, refused by hyperscan, refused by python (same question applies to other repeat modifiers like?
and+
except that??
has a special meaning).a*+?*+?\.google\.com$
: multiple consecutive different repeat modifiers: refused by golang, refused by hyperscan, refused by python. According to the documentation maybe only*?
,+?
and??
should be accepted?
Version tested:
- rust-lang/regex 1.4.6 (latest version at the time of creation of this issue)
- hyperscan 5.4.0
- Python 3.9.4 (using
re.compile
) - golang: https://regoio.herokuapp.com/
I think PCRE2 will accept 2 repeat modifiers like **
but not more, however I haven't verified those examples against PCRE2 yet.
If those examples are considered bugs, would it fixable? There might be backward compatibility concerns here?
If they are not considered bugs, maybe it could be explained/detailed in the doc?
In any case, I would want to refuse such syntax in the systems I operate.
According to you, what would be the best approach here? Would you recommend using the regex-syntax crate to parse and analyze the AST to detect such construct?
Thank you for your time :)