Skip to content

Commit 70c7f57

Browse files
committed
meta: fix anchored search bugs
It turns out that all three of the "reverse" optimizations in the meta regex engine did not support anchored searches correctly. This was intended, and in particular, none of these optimizations are active when the regex is anchored at the beginning. However, a caller can still request an anchored search even when the regex itself isn't anchored. In this case, the general best approach is to just do a standard forward regex search. Namely, the reverse suffix and reverse inner optimizations are generally throughput optimizations, and anchored searches tend to be more heavily dominated by latency. Now it is plausible that we will want to do some optimizations in the anchored case. For example, we might want to confirm that a required literal is in the haystack before running a standard forward regex search. But I think that's future work and will probably benefit from being a distinct strategy. It's also somewhat tricky to do because while it will make performance in the "no match" case much better, it will likely regress performance in the "always match" case. Anyway, we add more regression tests covering all of these cases and fix the bug. We fix it by just checking whether the caller requested an anchored search, and if so, fall back to the core engine. Fixes #1036
1 parent 40585af commit 70c7f57

File tree

2 files changed

+95
-0
lines changed

2 files changed

+95
-0
lines changed

regex-automata/src/meta/strategy.rs

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -845,6 +845,14 @@ impl ReverseAnchored {
845845
);
846846
return Err(core);
847847
}
848+
// Note that the caller can still request an anchored search even when
849+
// the regex isn't anchored at the start. We detect that case in the
850+
// search routines below and just fallback to the core engine. This
851+
// is fine because both searches are anchored. It's just a matter of
852+
// picking one. Falling back to the core engine is a little simpler,
853+
// since if we used the reverse anchored approach, we'd have to add an
854+
// extra check to ensure the match reported starts at the place where
855+
// the caller requested the search to start.
848856
if core.info.is_always_anchored_start() {
849857
debug!(
850858
"skipping reverse anchored optimization because \
@@ -930,6 +938,9 @@ impl Strategy for ReverseAnchored {
930938

931939
#[cfg_attr(feature = "perf-inline", inline(always))]
932940
fn search(&self, cache: &mut Cache, input: &Input<'_>) -> Option<Match> {
941+
if input.get_anchored().is_anchored() {
942+
return self.core.search(cache, input);
943+
}
933944
match self.try_search_half_anchored_rev(cache, input) {
934945
Err(_err) => {
935946
trace!("fast reverse anchored search failed: {}", _err);
@@ -948,6 +959,9 @@ impl Strategy for ReverseAnchored {
948959
cache: &mut Cache,
949960
input: &Input<'_>,
950961
) -> Option<HalfMatch> {
962+
if input.get_anchored().is_anchored() {
963+
return self.core.search_half(cache, input);
964+
}
951965
match self.try_search_half_anchored_rev(cache, input) {
952966
Err(_err) => {
953967
trace!("fast reverse anchored search failed: {}", _err);
@@ -973,6 +987,9 @@ impl Strategy for ReverseAnchored {
973987
input: &Input<'_>,
974988
slots: &mut [Option<NonMaxUsize>],
975989
) -> Option<PatternID> {
990+
if input.get_anchored().is_anchored() {
991+
return self.core.search_slots(cache, input, slots);
992+
}
976993
match self.try_search_half_anchored_rev(cache, input) {
977994
Err(_err) => {
978995
trace!("fast reverse anchored search failed: {}", _err);
@@ -1034,6 +1051,13 @@ impl ReverseSuffix {
10341051
// requires a reverse scan after a literal match to confirm or reject
10351052
// the match. (Although, in the case of confirmation, it then needs to
10361053
// do another forward scan to find the end position.)
1054+
//
1055+
// Note that the caller can still request an anchored search even
1056+
// when the regex isn't anchored. We detect that case in the search
1057+
// routines below and just fallback to the core engine. Currently this
1058+
// optimization assumes all searches are unanchored, so if we do want
1059+
// to enable this optimization for anchored searches, it will need a
1060+
// little work to support it.
10371061
if core.info.is_always_anchored_start() {
10381062
debug!(
10391063
"skipping reverse suffix optimization because \
@@ -1211,6 +1235,9 @@ impl Strategy for ReverseSuffix {
12111235

12121236
#[cfg_attr(feature = "perf-inline", inline(always))]
12131237
fn search(&self, cache: &mut Cache, input: &Input<'_>) -> Option<Match> {
1238+
if input.get_anchored().is_anchored() {
1239+
return self.core.search(cache, input);
1240+
}
12141241
match self.try_search_half_start(cache, input) {
12151242
Err(RetryError::Quadratic(_err)) => {
12161243
trace!("reverse suffix optimization failed: {}", _err);
@@ -1255,6 +1282,9 @@ impl Strategy for ReverseSuffix {
12551282
cache: &mut Cache,
12561283
input: &Input<'_>,
12571284
) -> Option<HalfMatch> {
1285+
if input.get_anchored().is_anchored() {
1286+
return self.core.search_half(cache, input);
1287+
}
12581288
match self.try_search_half_start(cache, input) {
12591289
Err(RetryError::Quadratic(_err)) => {
12601290
trace!("reverse suffix half optimization failed: {}", _err);
@@ -1309,6 +1339,9 @@ impl Strategy for ReverseSuffix {
13091339
input: &Input<'_>,
13101340
slots: &mut [Option<NonMaxUsize>],
13111341
) -> Option<PatternID> {
1342+
if input.get_anchored().is_anchored() {
1343+
return self.core.search_slots(cache, input, slots);
1344+
}
13121345
if !self.core.is_capture_search_needed(slots.len()) {
13131346
trace!("asked for slots unnecessarily, trying fast path");
13141347
let m = self.search(cache, input)?;
@@ -1396,6 +1429,13 @@ impl ReverseInner {
13961429
// or when the literal scan matches. If it matches, then confirming the
13971430
// match requires a reverse scan followed by a forward scan to confirm
13981431
// or reject, which is a fair bit of work.
1432+
//
1433+
// Note that the caller can still request an anchored search even
1434+
// when the regex isn't anchored. We detect that case in the search
1435+
// routines below and just fallback to the core engine. Currently this
1436+
// optimization assumes all searches are unanchored, so if we do want
1437+
// to enable this optimization for anchored searches, it will need a
1438+
// little work to support it.
13991439
if core.info.is_always_anchored_start() {
14001440
debug!(
14011441
"skipping reverse inner optimization because \
@@ -1635,6 +1675,9 @@ impl Strategy for ReverseInner {
16351675

16361676
#[cfg_attr(feature = "perf-inline", inline(always))]
16371677
fn search(&self, cache: &mut Cache, input: &Input<'_>) -> Option<Match> {
1678+
if input.get_anchored().is_anchored() {
1679+
return self.core.search(cache, input);
1680+
}
16381681
match self.try_search_full(cache, input) {
16391682
Err(RetryError::Quadratic(_err)) => {
16401683
trace!("reverse inner optimization failed: {}", _err);
@@ -1654,6 +1697,9 @@ impl Strategy for ReverseInner {
16541697
cache: &mut Cache,
16551698
input: &Input<'_>,
16561699
) -> Option<HalfMatch> {
1700+
if input.get_anchored().is_anchored() {
1701+
return self.core.search_half(cache, input);
1702+
}
16571703
match self.try_search_full(cache, input) {
16581704
Err(RetryError::Quadratic(_err)) => {
16591705
trace!("reverse inner half optimization failed: {}", _err);
@@ -1675,6 +1721,9 @@ impl Strategy for ReverseInner {
16751721
input: &Input<'_>,
16761722
slots: &mut [Option<NonMaxUsize>],
16771723
) -> Option<PatternID> {
1724+
if input.get_anchored().is_anchored() {
1725+
return self.core.search_slots(cache, input, slots);
1726+
}
16781727
if !self.core.is_capture_search_needed(slots.len()) {
16791728
trace!("asked for slots unnecessarily, trying fast path");
16801729
let m = self.search(cache, input)?;

testdata/anchored.toml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,49 @@ regex = '.c'
7979
haystack = 'abc'
8080
matches = []
8181
anchored = true
82+
83+
# Like above, but at a non-zero start offset.
84+
[[test]]
85+
name = "no-match-at-start-bounds"
86+
regex = '.c'
87+
haystack = 'aabc'
88+
bounds = [1, 4]
89+
matches = []
90+
anchored = true
91+
92+
# This is like no-match-at-start, but hits the "reverse inner" optimization
93+
# inside the meta engine. (no-match-at-start hits the "reverse suffix"
94+
# optimization.)
95+
[[test]]
96+
name = "no-match-at-start-reverse-inner"
97+
regex = '.c[a-z]'
98+
haystack = 'abcz'
99+
matches = []
100+
anchored = true
101+
102+
# Like above, but at a non-zero start offset.
103+
[[test]]
104+
name = "no-match-at-start-reverse-inner-bounds"
105+
regex = '.c[a-z]'
106+
haystack = 'aabcz'
107+
bounds = [1, 5]
108+
matches = []
109+
anchored = true
110+
111+
# Same as no-match-at-start, but applies to the meta engine's "reverse
112+
# anchored" optimization.
113+
[[test]]
114+
name = "no-match-at-start-reverse-anchored"
115+
regex = '.c[a-z]$'
116+
haystack = 'abcz'
117+
matches = []
118+
anchored = true
119+
120+
# Like above, but at a non-zero start offset.
121+
[[test]]
122+
name = "no-match-at-start-reverse-anchored-bounds"
123+
regex = '.c[a-z]$'
124+
haystack = 'aabcz'
125+
bounds = [1, 5]
126+
matches = []
127+
anchored = true

0 commit comments

Comments
 (0)