Add initial tokenizer for Python #310

certik · 2022-03-31T17:32:47Z

One can use it using lpython --show-tokens somefile. Currently the tokenizer is a Fortran tokenizer, so we need to adapt it to become a Python tokenizer.

Towards #298.

Thirumalai-Shaktivel

LGTM! Let's merge this once we fix the CI

Add and update tests.

Thirumalai-Shaktivel · 2022-04-01T14:28:52Z

Hi @certik, I updated the tokenizer to recognise Comments, DocString, Symbols and also made the CI pass.
Can you please review this??

Thirumalai-Shaktivel · 2022-04-01T16:00:47Z

I tested this with many Python files. I got some errors, which I fixed and will push here.
For the rest of the files, --show-tokens provides output without any errors!
Now I think we can move to the parser (Bison).

Thirumalai-Shaktivel · 2022-04-01T16:10:34Z

There is a conflict between String and Docstring in the tokenizer, we need to fix that.

certik · 2022-04-01T16:42:52Z

Can you get the CI passing again please?

run_tests.py

certik · 2022-04-01T17:17:43Z

I left a fix above, after the update I think this is good to go in.

After it is merged, we should do three more steps before moving to the parser:

Emit indent and dedent tokens properly (i.e., handle indentation in the tokenizer)
Access the tokens from CPython somehow (figure out how to do that), and then compare against it, that we are getting exactly those tokens and nothing more or less.
Remove all the Fortran stuff from the tokenizer

It doesn't have to be perfect, it won't be perfect, but I would definitely do the above three things. Then we can move to the parser and start parsing it. And then we'll iterate on the design of the tokenizer as needed.

I posted this comment at #298 (comment).

Updade the tests

Thirumalai-Shaktivel · 2022-04-01T17:31:53Z

Perfect, I will work on indent and dedent and send a PR.
Thanks for the review!

certik · 2022-04-02T20:50:00Z

Thanks! I think this looks good, now we just have to iteratively improve it, the TODO list is written up at: #298 (comment)

certik requested a review from Thirumalai-Shaktivel March 31, 2022 17:32

certik force-pushed the tokenizer branch from 41b5acf to 2c55d53 Compare March 31, 2022 17:34

Add initial tokenizer for Python

653f250

certik force-pushed the tokenizer branch from 2c55d53 to 653f250 Compare March 31, 2022 17:34

Thirumalai-Shaktivel added 2 commits April 1, 2022 10:18

Add tokenizer.cpp to gitignore

891c34f

Add tokenizer command to CI

d2574a5

Thirumalai-Shaktivel force-pushed the tokenizer branch from 03323a8 to d2574a5 Compare April 1, 2022 05:28

Thirumalai-Shaktivel approved these changes Apr 1, 2022

View reviewed changes

Thirumalai-Shaktivel added 8 commits April 1, 2022 11:25

Add iostream header

a5ed0ff

Recognize Comment as a token

513c640

Recognize Docstring as a token

deb0255

Add tokens to run_tests.py

0db9177

Add and update tests

3eaf490

Add EOLComment as a token.

fa68083

Add and update tests.

Recognize all the symbols as tokens

488bf89

Add and update tests

cd765fa

Recognize Backslash as a token

4a0ae9f

Thirumalai-Shaktivel force-pushed the tokenizer branch from b05ef58 to 4a0ae9f Compare April 1, 2022 17:01

certik commented Apr 1, 2022

View reviewed changes

run_tests.py Outdated Show resolved Hide resolved

certik mentioned this pull request Apr 1, 2022

Write a re2c+Bison parser for Python #298

Open

Rename the prefix to tokens.

896164e

Updade the tests

certik merged commit 44b247e into lcompilers:main Apr 2, 2022

certik deleted the tokenizer branch April 2, 2022 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add initial tokenizer for Python #310

Add initial tokenizer for Python #310

Uh oh!

certik commented Mar 31, 2022

Uh oh!

Thirumalai-Shaktivel left a comment

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022 •

edited

Loading

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022 •

edited

Loading

Uh oh!

certik commented Apr 1, 2022

Uh oh!

Uh oh!

certik commented Apr 1, 2022 •

edited

Loading

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022

Uh oh!

certik commented Apr 2, 2022

Uh oh!

Uh oh!

Add initial tokenizer for Python #310

Add initial tokenizer for Python #310

Uh oh!

Conversation

certik commented Mar 31, 2022

Uh oh!

Thirumalai-Shaktivel left a comment

Choose a reason for hiding this comment

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

certik commented Apr 1, 2022

Uh oh!

Uh oh!

certik commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022

Uh oh!

certik commented Apr 2, 2022

Uh oh!

Uh oh!

Thirumalai-Shaktivel commented Apr 1, 2022 •

edited

Loading

Thirumalai-Shaktivel commented Apr 1, 2022 •

edited

Loading

certik commented Apr 1, 2022 •

edited

Loading