Skip to content

Separate multi-word keywords from the rest #363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 1, 2022
Merged

Separate multi-word keywords from the rest #363

merged 8 commits into from
Aug 1, 2022

Conversation

nene
Copy link
Collaborator

@nene nene commented Aug 1, 2022

Added RESERVED_PHRASE token type. This serves as a place for multi-word sequences of keywords, to separate them from the single-word list of ordinary keywords.

Cleaned up all keyword lists. So now finally the *.keyword.ts files only list single-word keywords.

@nene nene requested a review from inferrinizzard August 1, 2022 10:30
@inferrinizzard
Copy link
Collaborator

inferrinizzard commented Aug 1, 2022

Doesn't the current regex ordering already prioritise by length?
These phrases should already be matching before their single word counterparts
Seems a little weird to split them only based on the fact they have spaces vs semantic meaning

@nene
Copy link
Collaborator Author

nene commented Aug 1, 2022

You're correct, we already prioritize by length. In the end it'll work out exactly the same. You're also right that there's nothing semantic about these sequences of keywords to group them together, other than the fact that they're multi-word sequences.

I had two goals for doing this reorganization:

  • First, our current keywords lists contain some "keywords" which aren't really single words. For me, this feels far from intuitive. Even the token name is RESERVED_KEYWORD, which suggest a single word, but currently it might be more than one word.
  • Second, I'd like to evolve our code more towards a parser-based approach. That is, separate keywords should be detected by lexer, but sequences of these keywords should really be handled at the parser level. For example, currently these keyword-sequence-tokens will break down when one adds a comment between them (e.g. INSERT /* comment */ INTO). Handling such cases is really not feasible inside lexer, so it makes sense to me to start separating the data needed by lexer (keywords) from data that's really territory of a parser (sequences of keywords).

@nene nene merged commit cbaddc0 into master Aug 1, 2022
@nene nene deleted the keyword-phrases branch August 1, 2022 19:10
@nene nene mentioned this pull request Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants