Skip to content

tests : improve UGM tokenizer test coverage #13773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 25, 2025

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented May 25, 2025

Adds nomic-bert-moe vocab for improved test coverage of UGM tokenizer.

@github-actions github-actions bot added the testing Everything test related label May 25, 2025
@CISC CISC requested a review from ggerganov May 25, 2025 11:39
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid adding more vocab files because they increase the size of the repository significantly. Ideally, the tests should download these files from HF and we can avoid source-controlling them here.

We can merge this one for now.

@CISC CISC merged commit aa50ba4 into master May 25, 2025
46 checks passed
@CISC CISC deleted the cisc/test-coverage-ugm-tokenizer branch May 25, 2025 14:22
@CISC
Copy link
Collaborator Author

CISC commented May 25, 2025

In that case it would make sense to have a vocab repo in ggml-org on HF?

@ggerganov
Copy link
Member

Yes. If you (or anyone else) is interested in implementing this, I will invite you to the HF org so you can set the repo. LMK

@CISC
Copy link
Collaborator Author

CISC commented May 25, 2025

Sure, I can do that, I sent a request.

@CISC
Copy link
Collaborator Author

CISC commented May 26, 2025

@ngxson I need the common_download_file_multiple function to be able to download and cache all the files in a repo, any thoughts around exposing this (and common_download_file_single I guess)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants