-
Notifications
You must be signed in to change notification settings - Fork 50
Add options support to the compiler #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
29faff1
2086a06
db077d5
bf0a8f5
29d8f0d
ffc1f5f
6e9a5db
f199910
57a550c
4fd82dd
26c83fd
a7a3917
265baac
1345bfc
7b390b7
25345cc
550068f
e50bad9
15c470d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,7 @@ | |
extension AST { | ||
/// An option written in source that changes matching semantics. | ||
public struct MatchingOption: Hashable { | ||
public enum Kind { | ||
public enum Kind: Int { | ||
// PCRE options | ||
case caseInsensitive // i | ||
case allowDuplicateGroupNames // J | ||
|
@@ -36,6 +36,11 @@ extension AST { | |
// be unset, only flipped between) | ||
case textSegmentGraphemeMode // y{g} | ||
case textSegmentWordMode // y{w} | ||
|
||
// Swift semantic matching level | ||
case graphemeClusterSemantics // X | ||
case unicodeScalarSemantics // u | ||
case byteSemantics // b | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these existing options, or are you defining Swift-specific syntax? We'll want to be able to lower this AST node (entirely) into a custom OptionSet-like model type using your merge semantics below. But, we do need a semantics for each option (even if that's just @hamishknight what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah these are Swift-specific, they don't conflict with the matching options from other engines tho AFAIK There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to keep these well-delineated in that case, and be very clear that this is something we're formally proposing as an addition to existing regex syntax. Should we prefix the names with We should also make a note or sub-section in #63 for these kinds of syntactic additions. They're otherwise easy to lose track of. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to #63. |
||
} | ||
public var kind: Kind | ||
public var location: SourceLocation | ||
|
@@ -53,6 +58,15 @@ extension AST { | |
return false | ||
} | ||
} | ||
|
||
public var isSemanticMatchingLevel: Bool { | ||
switch kind { | ||
case .graphemeClusterSemantics, .unicodeScalarSemantics, .byteSemantics: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do other engines have ways of switching semantic models between byte and scalar? How should we lower those? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only existing engine I've found that has this semantic switching is Rust's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Oniguruma, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll probably want a DSL equivalent for everything we support here. Have you thought about that at all, something like a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, we'll need some way to configure DSL-defined patterns, for these, case insensitivity, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are your thoughts there? This seems relevant to whether we should be using an enum in the AST as the same type/representation for our model. |
||
return true | ||
default: | ||
return false | ||
} | ||
} | ||
} | ||
|
||
/// A sequence of matching options written in source. | ||
|
@@ -79,6 +93,74 @@ extension AST { | |
self.minusLoc = minusLoc | ||
self.removing = removing | ||
} | ||
|
||
public func options(merging optionSet: MatchingOptionSet = []) -> MatchingOptionSet { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't this be a method on |
||
var result = optionSet | ||
for opt in adding { | ||
hamishknight marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if opt.isSemanticMatchingLevel { | ||
result.remove(.semanticMatchingLevels) | ||
} | ||
if opt.isTextSegmentMode { | ||
result.remove(.textSegmentOptions) | ||
} | ||
|
||
result.insert(.init(opt.kind)) | ||
} | ||
for opt in removing { | ||
result.remove(.init(opt.kind)) | ||
} | ||
return result | ||
} | ||
} | ||
|
||
/// A set of matching options. | ||
public struct MatchingOptionSet: OptionSet { | ||
natecook1000 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
public var rawValue: UInt32 | ||
|
||
public init(rawValue: UInt32) { | ||
self.rawValue = rawValue | ||
} | ||
|
||
public init(_ kind: AST.MatchingOption.Kind) { | ||
self.rawValue = 1 << kind.rawValue | ||
} | ||
|
||
// PCRE options | ||
public static var caseInsensitive: Self { .init(.caseInsensitive) } | ||
public static var allowDuplicateGroupNames: Self { .init(.allowDuplicateGroupNames) } | ||
public static var multiline: Self { .init(.multiline) } | ||
public static var noAutoCapture: Self { .init(.noAutoCapture) } | ||
public static var singleLine: Self { .init(.singleLine) } | ||
public static var reluctantByDefault: Self { .init(.reluctantByDefault) } | ||
public static var extended: Self { .init(.extended) } | ||
public static var extraExtended: Self { .init(.extraExtended) } | ||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// ICU options | ||
public static var unicodeWordBoundaries: Self { .init(.unicodeWordBoundaries) } | ||
|
||
// Oniguruma options | ||
public static var asciiOnlyDigit: Self { .init(.asciiOnlyDigit) } | ||
public static var asciiOnlyPOSIXProps: Self { .init(.asciiOnlyPOSIXProps) } | ||
public static var asciiOnlySpace: Self { .init(.asciiOnlySpace) } | ||
public static var asciiOnlyWord: Self { .init(.asciiOnlyWord) } | ||
|
||
// Oniguruma text segment options (these are mutually exclusive and cannot | ||
// be unset, only flipped between) | ||
public static var textSegmentGraphemeMode: Self { .init(.textSegmentGraphemeMode) } | ||
public static var textSegmentWordMode: Self { .init(.textSegmentWordMode) } | ||
|
||
public static var textSegmentOptions: Self { | ||
[.textSegmentGraphemeMode, .textSegmentWordMode] | ||
} | ||
|
||
// Swift semantic matching level | ||
public static var graphemeClusterSemantics: Self { .init(.graphemeClusterSemantics) } | ||
public static var unicodeScalarSemantics: Self { .init(.unicodeScalarSemantics) } | ||
public static var byteSemantics: Self { .init(.byteSemantics) } | ||
|
||
public static var semanticMatchingLevels: Self { | ||
[.graphemeClusterSemantics, .unicodeScalarSemantics, .byteSemantics] | ||
} | ||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -550,6 +550,11 @@ extension Source { | |
try src.expect("}") | ||
return opt | ||
|
||
// Swift semantic level options | ||
case "X": return advanceAndReturn(.graphemeClusterSemantics) | ||
case "u": return advanceAndReturn(.unicodeScalarSemantics) | ||
case "b": return advanceAndReturn(.byteSemantics) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this novel syntax we're talking about adding, or does this exist? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These would be Swift-specific flags. The |
||
default: | ||
return nil | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -338,8 +338,6 @@ extension AST.Atom { | |
switch kind { | ||
case let .escaped(b): return b.characterClass | ||
|
||
case .any: return .any | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's happening to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like a reasonable line for someone to add. Could you explicitly return nil with a comment explaining why we don't want to produce a character class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (outside the scope of this PR) Or for that matter, should we remove or redesign
We got it off the AST a while back, but we still have this as being We definitely will want some type that people can use as namespaces to look up built-in sets, and a type for registering their own via closures (either We will separately want a model-layer type to assist in compilation. That might be the same type if it happens to align, but I definitely don't want to try to burden an API type with that need. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a note about where |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing this means that the |
||
case .property: | ||
// TODO: Would our model type for character classes include | ||
// this? Or does grapheme-semantic mode complicate that? | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,18 +18,22 @@ struct RegexProgram { | |
|
||
class Compiler { | ||
let ast: AST | ||
let matchLevel: CharacterClass.MatchLevel | ||
let options: REOptions | ||
private var optionStack: MatchingOptionSetStack | ||
private var builder = RegexProgram.Program.Builder() | ||
|
||
private var currentOptions: AST.MatchingOptionSet { | ||
guard let top = optionStack.top else { | ||
fatalError("Unbalanced matching options removal") | ||
natecook1000 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
return top | ||
} | ||
|
||
init( | ||
ast: AST, | ||
matchLevel: CharacterClass.MatchLevel = .graphemeCluster, | ||
options: REOptions = [] | ||
options: AST.MatchingOptionSet = [.init(.graphemeClusterSemantics)] | ||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) { | ||
self.ast = ast | ||
self.matchLevel = matchLevel | ||
self.options = options | ||
self.optionStack = MatchingOptionSetStack(options) | ||
} | ||
|
||
__consuming func emit() throws -> RegexProgram { | ||
|
@@ -42,11 +46,6 @@ class Compiler { | |
func emit(_ node: AST) throws { | ||
|
||
switch node { | ||
// Any: . | ||
// consume 1 | ||
case .atom(let a) where a.kind == .any && matchLevel == .graphemeCluster: | ||
builder.buildAdvance(1) | ||
|
||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// Single characters we just match | ||
case .atom(let a) where a.singleCharacter != nil : | ||
builder.buildMatch(a.singleCharacter!) | ||
|
@@ -91,6 +90,9 @@ class Compiler { | |
break | ||
|
||
case .group(let g): | ||
optionStack.push(currentOptions) | ||
defer { optionStack.pop() } | ||
|
||
if let lookaround = g.lookaroundKind { | ||
try emitLookaround(lookaround, g.child) | ||
return | ||
|
@@ -107,6 +109,18 @@ class Compiler { | |
try emit(g.child) | ||
builder.buildEndCapture(cap) | ||
|
||
case .changeMatchingOptions(let optionSequence, isIsolated: let isIsolated): | ||
let updated = optionSequence.options(merging: currentOptions) | ||
natecook1000 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if isIsolated { | ||
optionStack.replaceTop(updated) | ||
try emit(g.child) | ||
} else { | ||
optionStack.push(updated) | ||
try emit(g.child) | ||
optionStack.pop() | ||
} | ||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
default: | ||
// FIXME: Other kinds... | ||
try emit(g.child) | ||
|
@@ -118,8 +132,8 @@ class Compiler { | |
// For now, we model sets and atoms as consumers. | ||
// This lets us rapidly expand support, and we can better | ||
// design the actual instruction set with real examples | ||
case _ where try node.generateConsumer(matchLevel) != nil: | ||
try builder.buildConsume(by: node.generateConsumer(matchLevel)!) | ||
case _ where try node.generateConsumer(currentOptions) != nil: | ||
try builder.buildConsume(by: node.generateConsumer(currentOptions)!) | ||
|
||
case .quote(let q): | ||
// We stick quoted content into read-only constant strings | ||
|
@@ -474,6 +488,47 @@ class Compiler { | |
} | ||
} | ||
|
||
/// A stack of `MatchingOptionSet`s. | ||
fileprivate struct MatchingOptionSetStack { | ||
internal var stack: [AST.MatchingOptionSet] | ||
natecook1000 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
init(_ initial: AST.MatchingOptionSet) { | ||
self.stack = [initial] | ||
} | ||
|
||
natecook1000 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
var top: AST.MatchingOptionSet? { stack.last } | ||
|
||
mutating func push(_ set: AST.MatchingOptionSet) { | ||
stack.append(set) | ||
} | ||
|
||
mutating func replaceTop(_ set: AST.MatchingOptionSet) { | ||
stack.removeLast() | ||
stack.append(set) | ||
} | ||
|
||
@discardableResult | ||
mutating func pop() -> AST.MatchingOptionSet { | ||
stack.removeLast() | ||
} | ||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
// Deprecated matchLevel-based initializer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who calls this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only a legacy test for semantic level matching that isn't what we're planning anyway. I've removed the test. |
||
extension Compiler { | ||
@available(*, deprecated) | ||
convenience init( | ||
ast: AST, | ||
matchLevel: CharacterClass.MatchLevel, | ||
options: REOptions = [] | ||
) { | ||
if matchLevel == .graphemeCluster { | ||
self.init(ast: ast, options: .init(.graphemeClusterSemantics)) | ||
} else { | ||
self.init(ast: ast, options: .init(.unicodeScalarSemantics)) | ||
} | ||
} | ||
} | ||
|
||
public func _compileRegex( | ||
_ regex: String, _ syntax: SyntaxOptions = .traditional | ||
) throws -> Executor { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,12 +28,8 @@ func unsupported( | |
file: StaticString = #file, | ||
line: UInt = #line | ||
) -> Unsupported { | ||
// TODO: how do we not have a public init for this? | ||
let fStr = file.withUTF8Buffer { | ||
String(decoding: $0, as: UTF8.self) | ||
} | ||
return Unsupported( | ||
message: s, file: fStr, line: Int(line)) | ||
message: s, file: String(describing: file), line: Int(line)) | ||
} | ||
|
||
extension AST { | ||
|
@@ -42,8 +38,7 @@ extension AST { | |
/// A consumer is a Swift closure that matches against | ||
/// the front of an input range | ||
func generateConsumer( | ||
// TODO: Better option modeling | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction? { | ||
switch self { | ||
case .atom(let a): | ||
|
@@ -77,10 +72,13 @@ extension AST.Atom { | |
} | ||
|
||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction? { | ||
// TODO: Wean ourselves off of this type... | ||
if let cc = self.characterClass?.withMatchLevel(opts) { | ||
let matchLevel: CharacterClass.MatchLevel = opts.contains(.unicodeScalarSemantics) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably want to be calling richer semantic API here, so that changes to exactly what combinations of options mean can be abstracted and in one place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, I think we need to get rid of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using a set-membership check of a syntactic enum feels a bit like a smell. What if there are multiple ways to spell it, or it came in via API? Should this instead be a |
||
? .unicodeScalar | ||
: .graphemeCluster | ||
if let cc = self.characterClass?.withMatchLevel(matchLevel) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, let's get that logic above into the richer semantic type.... |
||
return { input, bounds in | ||
// FIXME: should we worry about out of bounds? | ||
cc.matches(in: input, at: bounds.lowerBound) | ||
|
@@ -109,9 +107,20 @@ extension AST.Atom { | |
// TODO: alias? casing? | ||
$0.name == name || $0.nameAlias == name | ||
} | ||
|
||
case .any: | ||
return { input, bounds in | ||
let curIndex = bounds.lowerBound | ||
if !opts.contains(.singleLine) && input[curIndex].isNewline { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, |
||
return nil | ||
} | ||
return opts.contains(.graphemeClusterSemantics) | ||
? input.index(after: curIndex) | ||
: input.unicodeScalars.index(after: curIndex) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is definitely something we're going to want to represent in the matching engine directly. |
||
|
||
case .escaped, .keyboardControl, .keyboardMeta, .keyboardMetaControl, | ||
.any, .startOfLine, .endOfLine, | ||
.startOfLine, .endOfLine, | ||
.backreference, .subpattern, .condition: | ||
// FIXME: implement | ||
return nil | ||
|
@@ -121,7 +130,7 @@ extension AST.Atom { | |
|
||
extension AST.CustomCharacterClass.Member { | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction { | ||
switch self { | ||
case .custom(let ccc): | ||
|
@@ -212,7 +221,7 @@ extension AST.CustomCharacterClass.Member { | |
|
||
extension AST.CustomCharacterClass { | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction { | ||
// NOTE: Easy way to implement, obviously not performant | ||
let consumers = try members.map { | ||
|
@@ -265,7 +274,7 @@ private func consumeScalar( | |
|
||
extension AST.Atom.CharacterProperty { | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction { | ||
// Handle inversion for us, albeit not efficiently | ||
func invert( | ||
|
@@ -335,7 +344,7 @@ extension AST.Atom.CharacterProperty { | |
extension Unicode.BinaryProperty { | ||
// FIXME: Semantic level, vet for precise defs | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction { | ||
switch self { | ||
|
||
|
@@ -499,7 +508,7 @@ extension Unicode.BinaryProperty { | |
extension Unicode.POSIXProperty { | ||
// FIXME: Semantic level, vet for precise defs | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) -> Program<String>.ConsumeFunction { | ||
// FIXME: semantic levels, modes, etc | ||
switch self { | ||
|
@@ -545,7 +554,7 @@ extension Unicode.POSIXProperty { | |
extension Unicode.ExtendedGeneralCategory { | ||
// FIXME: Semantic level | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: AST.MatchingOptionSet | ||
) throws -> Program<String>.ConsumeFunction { | ||
switch self { | ||
case .letter: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this is
Int
-backed, we can use the raw value as the bit offset forMatchingOptionSet.rawValue
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we'll want a model-layer enum or something that represents the compiler's semantic understanding of options rather than the parser's syntactic representation. I ask because this can be (but isn't necessarily!) a code smell that we're still conflating the syntax layer with the semantic layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is every option here something that would make sense in our DSL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of these don't really make sense for the DSL:
extra
,extraExtended
: about literal parsing rather than actual matchingallowDuplicateGroupNames
,noAutoCapture
: the DSL will have a different way of capturingreluctantByDefault
: irrelevant (and/or terrible) for the DSL APIThe rest affect how a pattern matches, but I don't know if we'll want to do them as "options" per se or as different function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there options that make sense for the DSL but are not present in a literal? How divergent is our syntactic representation from our semantic model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the
: Int
, make a model-layer enum with our option model, function to lower from AST to that model.