Skip to content

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

@sayakpaul

Description

@sayakpaul

Context

Refer to #8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too:

  • advanced_diffusion_training

    • train_dreambooth_lora_sd15_advanced.py
    • train_dreambooth_lora_sdxl_advanced.py
  • consistency_distillation

    • train_lcm_distill_lora_sdxl.py
  • controlnet

    • train_controlnet.py
    • train_controlnet_sdxl.py
  • custom_diffusion

    • train_custom_diffusion.py
  • dreambooth

    • train_dreambooth.py
    • train_dreambooth_lora.py
    • train_dreambooth_lora_sdxl.py
  • instruct_pix2pix

    • train_instruct_pix2pix.py
    • rain_instruct_pix2pix_sdxl.py
  • kandinsky2_2/text_to_image

    • train_text_to_image_decoder.py
    • train_text_to_image_prior.py
    • train_text_to_image_lora_decoder.py
    • train_text_to_image_lora_prior.py
  • t2i_adapter

    • train_t2i_adapter_sdxl.py
  • text_to_image

    • train_text_to_image.py
    • train_text_to_image_sdxl.py
    • train_text_to_image_lora.py
    • train_text_to_image_lora_sdxl.py
  • textual_inversion

    • textual_inversion.py
    • textual_inversion_sdxl.py
  • unconditional_image_generation

    • train_unconditional.py
  • wuerstchen

    • text_to_image/train_text_to_image_prior.py
    • text_to_image/train_text_to_image_lora_prior.py
  • research_projects (low-priority)

    • consistency_training/train_cm_ct_unconditional.py
    • diffusion_dpo/train_diffusion_dpo.py
    • diffusion_dpo/train_diffusion_dpo_sdxl.py
    • diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
    • dreambooth_inpaint/train_dreambooth_inpaint.py
    • dreambooth_inpaint/train_dreambooth_inpaint_lora.py
    • instructpix2pix_lora/train_instruct_pix2pix_lora.py
    • intel_opts/textual_inversion/textual_inversion_bf16.py
    • intel_opts/textual_inversion_dfq/textual_inversion.py
    • lora/train_text_to_image_lora.py
    • multi_subject_dreambooth/train_multi_subject_dreambooth.py
    • multi_token_textual_inversion/textual_inversion.py
    • onnxruntime/text_to_image/train_text_to_image.py
    • onnxruntime/textual_inversion/textual_inversion.py
    • onnxruntime/unconditional_image_generation/train_unconditional.py
    • realfill/train_realfill.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py

The following scripts do not have the argument --num_train_epochs:

  • amused
    • train_amused.py
  • research_projects
    • multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

So, they don't need to be updated.

Then we have the following scripts that don't use accelerator to prepare the datasets:

Distributed dataset sharding is done by WebDataset, not accelerator. So, we can skip them for now.

  • consistency_distillation
    • train_lcm_distill_sd_wds.py
    • train_lcm_distill_sdxl_wds.py
    • train_lcm_distill_lora_sd_wds.py
    • train_lcm_distill_lora_sdxl_wds.py
  • research_projects
    • controlnet/train_controlnet_webdataset.py
    • diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

Steps to follow when opening PRs

  • Target one AND only one training script in a single PR.
  • When you open a PR, please mention this issue.
  • Mention @sayakpaul and @geniuspatrick for a review.
  • Accompany your PR with a minimal training command using the num_train_epochs CLI arg.
  • Enjoy!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions