Skip to content

Improve Tokenizer New Type Onboarding #1536

Open
@zhenyan-zhang-meta

Description

@zhenyan-zhang-meta

🚀 The feature, motivation and pitch


As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:

Tasks


  • Move TokenizerType to a centralized place
  • Check all getters of tokenizer types
  • Add documentation for future tokenizer onboard.
    • We may need to point people to update the model validation logic:
      def validate_model(
      self,
      model: Optional[Model],
      model_description: str = "model",
      ) -> None:
      if model is None:
      return
      if self.tokenizer_type == TokenizerType.NONE:
      raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")
      is_tiktoken = self.is_tiktoken()
      is_sentencepiece = self.is_sentencepiece()
      is_hf_tokenizer = self.is_hf_tokenizer()
      use_tiktoken = model.config.use_tiktoken
      use_hf_tokenizer = model.config.use_hf_tokenizer
      use_sentencepiece = not (use_tiktoken or use_hf_tokenizer)
      if (
      (is_tiktoken and not use_tiktoken) or
      (is_hf_tokenizer and not use_hf_tokenizer) or
      (is_sentencepiece and not use_sentencepiece)
      ):
      raise RuntimeError(
      "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(
      tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),
      tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),
      model_description,
      )
      )
      return

To test, run a model with each tokenizer type:

  • python torchchat.py generate llama2
  • python torchchat.py generate llama3
  • python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw

Metadata

Metadata

Assignees

Labels

actionableItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions