Skip to content

Commit 40585af

Browse files
committed
pikevm: fix anchored search bug
This fixes a bug where one could ask the PikeVM to perform an anchored search, but in some cases it could return a match where the start of the match is greater than the start of the search. For example, an anchored search of the pattern '.c' on the haystack 'abc' starting at '0' would report a match at '1..3'. No other engine (other than the meta engine, which we'll address in a subsequent commit) had this bug. The issue in the pikevm was our simulation of the '(?s-u:.)*?' prefix for implementing unanchored searches. Namely, instead of using the NFA itself to implement the unanchored search (it has both unanchored and anchored start states), the PikeVM simulates it in code for performance reasons. This simulation was actually incorrect for the anchored case, because we were re-computing the epsilon closure for every step in the search. Effectively, we were simulating an unanchored search unconditionally. Now the reason why this bug wasn't caught is because the PikeVM only gets things half wrong. Namely, the regex '[b-z]c' does not match 'abc' when starting the search at offset '0' and that's correct. The reason is that the '[b-z]' doesn't match 'a', where as '.' in the aforementioned regex does. Since the PikeVM doesn't match there, it's current list of states becomes empty, and *this* case is anchor-aware and knows not to continue the search in this case. In other words, the PikeVM only half-implemented the unanchored search simulation. It gets it right in some cases, but not all. We fix the bug by requiring that we only do the epsilon closure when the search is unanchored, or if it's anchored, that the current position is at the start of the search. We add a regression test from #1036 as well. Partially resolves #1036
1 parent bbb285b commit 40585af

File tree

2 files changed

+19
-1
lines changed

2 files changed

+19
-1
lines changed

regex-automata/src/nfa/thompson/pikevm.rs

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1356,7 +1356,15 @@ impl PikeVM {
13561356
// matches their behavior. (Generally, 'allmatches' is useful for
13571357
// overlapping searches or leftmost anchored searches to find the
13581358
// longest possible match by ignoring match priority.)
1359-
if !pid.is_some() || allmatches {
1359+
//
1360+
// Additionally, when we're running an anchored search, this
1361+
// epsilon closure should only be computed at the beginning of the
1362+
// search. If we re-computed it at every position, we would be
1363+
// simulating an unanchored search when we were tasked to perform
1364+
// an anchored search.
1365+
if (!pid.is_some() || allmatches)
1366+
&& (!anchored || at == input.start())
1367+
{
13601368
// Since we are adding to the 'curr' active states and since
13611369
// this is for the start ID, we use a slots slice that is
13621370
// guaranteed to have the right length but where every element

testdata/anchored.toml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,13 @@ haystack = 'abcβ'
6969
matches = [[0, 3]]
7070
anchored = true
7171
unicode = false
72+
73+
# Tests that '.c' doesn't match 'abc' when performing an anchored search from
74+
# the beginning of the haystack. This test found two different bugs in the
75+
# PikeVM and the meta engine.
76+
[[test]]
77+
name = "no-match-at-start"
78+
regex = '.c'
79+
haystack = 'abc'
80+
matches = []
81+
anchored = true

0 commit comments

Comments
 (0)