Skip to content

Add initial tokenizer for Python #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 2, 2022
Merged

Add initial tokenizer for Python #310

merged 13 commits into from
Apr 2, 2022

Conversation

certik
Copy link
Contributor

@certik certik commented Mar 31, 2022

One can use it using lpython --show-tokens somefile. Currently the tokenizer is a Fortran tokenizer, so we need to adapt it to become a Python tokenizer.

Towards #298.

Copy link
Collaborator

@Thirumalai-Shaktivel Thirumalai-Shaktivel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's merge this once we fix the CI

@Thirumalai-Shaktivel
Copy link
Collaborator

Hi @certik, I updated the tokenizer to recognise Comments, DocString, Symbols and also made the CI pass.
Can you please review this??

@Thirumalai-Shaktivel
Copy link
Collaborator

Thirumalai-Shaktivel commented Apr 1, 2022

I tested this with many Python files. I got some errors, which I fixed and will push here.
For the rest of the files, --show-tokens provides output without any errors!
Now I think we can move to the parser (Bison).

@Thirumalai-Shaktivel
Copy link
Collaborator

Thirumalai-Shaktivel commented Apr 1, 2022

There is a conflict between String and Docstring in the tokenizer, we need to fix that.

@certik
Copy link
Contributor Author

certik commented Apr 1, 2022

Can you get the CI passing again please?

@certik
Copy link
Contributor Author

certik commented Apr 1, 2022

I left a fix above, after the update I think this is good to go in.

After it is merged, we should do three more steps before moving to the parser:

  • Emit indent and dedent tokens properly (i.e., handle indentation in the tokenizer)
  • Access the tokens from CPython somehow (figure out how to do that), and then compare against it, that we are getting exactly those tokens and nothing more or less.
  • Remove all the Fortran stuff from the tokenizer

It doesn't have to be perfect, it won't be perfect, but I would definitely do the above three things. Then we can move to the parser and start parsing it. And then we'll iterate on the design of the tokenizer as needed.

I posted this comment at #298 (comment).

@Thirumalai-Shaktivel
Copy link
Collaborator

Perfect, I will work on indent and dedent and send a PR.
Thanks for the review!

@certik certik merged commit 44b247e into lcompilers:main Apr 2, 2022
@certik certik deleted the tokenizer branch April 2, 2022 20:49
@certik
Copy link
Contributor Author

certik commented Apr 2, 2022

Thanks! I think this looks good, now we just have to iteratively improve it, the TODO list is written up at: #298 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants