Separate multi-word keywords from the rest #363

nene · 2022-08-01T10:30:26Z

Added RESERVED_PHRASE token type. This serves as a place for multi-word sequences of keywords, to separate them from the single-word list of ordinary keywords.

Cleaned up all keyword lists. So now finally the *.keyword.ts files only list single-word keywords.

This serves as a place for multi-word sequences of keywords, to separate them from the single-word list of ordinary keywords.

Leaving just two remaining.

inferrinizzard · 2022-08-01T14:50:44Z

Doesn't the current regex ordering already prioritise by length?
These phrases should already be matching before their single word counterparts
Seems a little weird to split them only based on the fact they have spaces vs semantic meaning

nene · 2022-08-01T18:22:09Z

You're correct, we already prioritize by length. In the end it'll work out exactly the same. You're also right that there's nothing semantic about these sequences of keywords to group them together, other than the fact that they're multi-word sequences.

I had two goals for doing this reorganization:

First, our current keywords lists contain some "keywords" which aren't really single words. For me, this feels far from intuitive. Even the token name is RESERVED_KEYWORD, which suggest a single word, but currently it might be more than one word.
Second, I'd like to evolve our code more towards a parser-based approach. That is, separate keywords should be detected by lexer, but sequences of these keywords should really be handled at the parser level. For example, currently these keyword-sequence-tokens will break down when one adds a comment between them (e.g. INSERT /* comment */ INTO). Handling such cases is really not feasible inside lexer, so it makes sense to me to start separating the data needed by lexer (keywords) from data that's really territory of a parser (sequences of keywords).

nene added 8 commits August 1, 2022 12:41

Add RESERVED_PHRASE token type

c7d5636

This serves as a place for multi-word sequences of keywords, to separate them from the single-word list of ordinary keywords.

Move constraints to reservedPhrases

7add2b3

Move CHARACTER SET to reserved phrases

7ea6642

Move TABLESAMPLE SYSTEM to reservedPhrases

4e1feb8

Separate out multi-word DDL phrases for BigQuery

c36f82c

Move CURRENT ROW out of Spark keywords

cb74636

Move several Redshift keywords to reservedPhrases

7abe002

Remove Redhift data-type keywords which are already listed as functions

93fc545

Leaving just two remaining.

nene requested a review from inferrinizzard August 1, 2022 10:30

inferrinizzard approved these changes Aug 1, 2022

View reviewed changes

nene merged commit cbaddc0 into master Aug 1, 2022

nene deleted the keyword-phrases branch August 1, 2022 19:10

nene mentioned this pull request Aug 8, 2022

S2db support #379

Merged

karlhorky mentioned this pull request Nov 28, 2023

Add experimental dataTypeCase and functionCase options #673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate multi-word keywords from the rest #363

Separate multi-word keywords from the rest #363

Uh oh!

nene commented Aug 1, 2022

Uh oh!

inferrinizzard commented Aug 1, 2022 •

edited

Loading

Uh oh!

nene commented Aug 1, 2022

Uh oh!

Uh oh!

Separate multi-word keywords from the rest #363

Separate multi-word keywords from the rest #363

Uh oh!

Conversation

nene commented Aug 1, 2022

Uh oh!

inferrinizzard commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nene commented Aug 1, 2022

Uh oh!

Uh oh!

inferrinizzard commented Aug 1, 2022 •

edited

Loading