pikevm: fix anchored search bug

BurntSushi · BurntSushi · commit 40585afe9402 · 2023-07-12T10:13:36.000-04:00
This fixes a bug where one could ask the PikeVM to perform an anchored search, but in some cases it could return a match where the start of the match is greater than the start of the search. For example, an anchored search of the pattern '.c' on the haystack 'abc' starting at '0' would report a match at '1..3'. No other engine (other than the meta engine, which we'll address in a subsequent commit) had this bug. The issue in the pikevm was our simulation of the '(?s-u:.)*?' prefix for implementing unanchored searches. Namely, instead of using the NFA itself to implement the unanchored search (it has both unanchored and anchored start states), the PikeVM simulates it in code for performance reasons. This simulation was actually incorrect for the anchored case, because we were re-computing the epsilon closure for every step in the search. Effectively, we were simulating an unanchored search unconditionally. Now the reason why this bug wasn't caught is because the PikeVM only gets things half wrong. Namely, the regex '[b-z]c' does not match 'abc' when starting the search at offset '0' and that's correct. The reason is that the '[b-z]' doesn't match 'a', where as '.' in the aforementioned regex does. Since the PikeVM doesn't match there, it's current list of states becomes empty, and *this* case is anchor-aware and knows not to continue the search in this case. In other words, the PikeVM only half-implemented the unanchored search simulation. It gets it right in some cases, but not all. We fix the bug by requiring that we only do the epsilon closure when the search is unanchored, or if it's anchored, that the current position is at the start of the search. We add a regression test from #1036 as well. Partially resolves #1036
diff --git a/regex-automata/src/nfa/thompson/pikevm.rs b/regex-automata/src/nfa/thompson/pikevm.rs
@@ -1356,7 +1356,15 @@ impl PikeVM {
             // matches their behavior. (Generally, 'allmatches' is useful for
             // overlapping searches or leftmost anchored searches to find the
             // longest possible match by ignoring match priority.)
-            if !pid.is_some() || allmatches {
+            //
+            // Additionally, when we're running an anchored search, this
+            // epsilon closure should only be computed at the beginning of the
+            // search. If we re-computed it at every position, we would be
+            // simulating an unanchored search when we were tasked to perform
+            // an anchored search.
+            if (!pid.is_some() || allmatches)
+                && (!anchored || at == input.start())
+            {
                 // Since we are adding to the 'curr' active states and since
                 // this is for the start ID, we use a slots slice that is
                 // guaranteed to have the right length but where every element
diff --git a/testdata/anchored.toml b/testdata/anchored.toml
@@ -69,3 +69,13 @@ haystack = 'abcβ'
 matches = [[0, 3]]
 anchored = true
 unicode = false
+
+# Tests that '.c' doesn't match 'abc' when performing an anchored search from
+# the beginning of the haystack. This test found two different bugs in the
+# PikeVM and the meta engine.
+[[test]]
+name = "no-match-at-start"
+regex = '.c'
+haystack = 'abc'
+matches = []
+anchored = true