Skip to content

Emit custom character classes like an alternation #590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from

Conversation

rctcwyvrn
Copy link
Contributor

@rctcwyvrn rctcwyvrn commented Jul 26, 2022

I thought this would be relatively easy to implement and I was curious what the performance benefits were so I implemented it quickly

A number of regressions in benchmarks without any code that hits any of the changes (most of these don't even have a custom character class). Most likely due to the instruction I added?

As expected EmailRFCNoMatches is much faster now that it's non-ascii CCC has it's ascii members collected into a bitset

The big benefit is being able to remove all the code in ConsumerInterface that did matching on things we already have instructions for (characters, scalars, case insensitive, etc)

Based on #547

Comparing against benchmark result file before.json
=== Regressions ======================================================================
- EmailLookaheadAll                       88.6ms	85.8ms	2.84ms		3.3%
- DiceRollsInTextAll                      68.7ms	66ms	2.66ms		4.0%
- EmailLookaheadNoMatchesAll              63.2ms	61.1ms	2.09ms		3.4%
- BasicBuiltinCharacterClassAll           16.2ms	14.9ms	1.32ms		8.9%
- EmailLookaheadList                      24.5ms	23.4ms	1.04ms		4.4%
- EmailBuiltinCharacterClassAll           26.2ms	25.1ms	1.02ms		4.1%
- NumbersAll                              11.3ms	10.3ms	986µs		9.6%
- InvertedCCC                             29.6ms	28.9ms	668µs		2.3%
- WordsAll                                26.3ms	25.9ms	411µs		1.6%
- AnchoredNotFoundWhole                   9.73ms	9.41ms	328µs		3.5%
=== Improvements =====================================================================
- EmailRFCNoMatchesAll                    62.8ms	106ms	-43.1ms		-40.7%
- symDiffCCC                              19.4ms	40.7ms	-21.4ms		-52.5%
- IntersectionCCC                         12ms	16.2ms	-4.2ms		-26.0%
- SubtractionCCC                          11.7ms	15.7ms	-3.92ms		-25.0%
- EmailRFCAll                             48.5ms	50.6ms	-2.14ms		-4.2%
- EagarQuantWithTerminalWhole             7.75ms	8.41ms	-663µs		-7.9%

rctcwyvrn added 27 commits July 5, 2022 14:21
- matchBuiltin always fails if at endIndex
- fix switch in isStrictAscii
- static vars in payloads
- Clean up _CharacterClassModel
- Use the model for bytecodegen and consumer interface
- Merge the grapheme and scalar match builtin cases together
- Removes the main consumer interface for ccc
- Removes a lot of the consumer interface code required for ccc
- Adds an optimization for collecting the ascii parts of a ccc
- Use normal matching code in a CCC
Comment on lines +1031 to +1033
/// We allow trivia into CustomCharacterClass, which could result in a CCC that matches nothing
/// ie (?x)[ ]
var guaranteesForwardProgress: Bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should reject (?x)[ ] at parse time (we should reject a character class with no semantic members). That being said, it is possible to have a custom character class that matches nothing, e.g [a--a] (though as I understand it, that would still guarantee forward progress).

@rctcwyvrn rctcwyvrn changed the title Replace custom character classes with an alternation Emit custom character classes like an alternation Jul 26, 2022
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Apr 13, 2023
This is based heavily off the work in swiftlang#590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Apr 13, 2023
This is based heavily off the work in swiftlang#590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Apr 13, 2023
This is based heavily off the work in swiftlang#590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Apr 14, 2023
This is based heavily off the work in swiftlang#590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Apr 14, 2023
This is based heavily off the work in swiftlang#590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
@natecook1000
Copy link
Member

Superseded by #660.

natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit that referenced this pull request Apr 1, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants