[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env

## Context

Refer to https://github.com/huggingface/diffusers/pull/8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too: 

- advanced_diffusion_training
    - [x] train_dreambooth_lora_sd15_advanced.py
    - [x] train_dreambooth_lora_sdxl_advanced.py
- consistency_distillation
    - [x] train_lcm_distill_lora_sdxl.py
- controlnet
    - [x] train_controlnet.py
    - [x] train_controlnet_sdxl.py
- custom_diffusion
    - [x] train_custom_diffusion.py
- dreambooth
    - [x] train_dreambooth.py
    - [x] train_dreambooth_lora.py
    - [x] train_dreambooth_lora_sdxl.py
- instruct_pix2pix
    - [x] train_instruct_pix2pix.py
    - [ ] rain_instruct_pix2pix_sdxl.py
- kandinsky2_2/text_to_image
    - [ ] train_text_to_image_decoder.py
    - [ ] train_text_to_image_prior.py
    - [ ] train_text_to_image_lora_decoder.py
    - [ ] train_text_to_image_lora_prior.py

- t2i_adapter
    - [ ] train_t2i_adapter_sdxl.py
- text_to_image
    - [x] train_text_to_image.py
    - [ ] train_text_to_image_sdxl.py
    - [x] train_text_to_image_lora.py
    - [ ] train_text_to_image_lora_sdxl.py
- textual_inversion
    - [ ] textual_inversion.py
    - [ ] textual_inversion_sdxl.py
- unconditional_image_generation
    - [ ] train_unconditional.py
- wuerstchen
    - [ ] text_to_image/train_text_to_image_prior.py
    - [ ] text_to_image/train_text_to_image_lora_prior.py
- research_projects (low-priority) 
    - [ ] consistency_training/train_cm_ct_unconditional.py
    - [ ] diffusion_dpo/train_diffusion_dpo.py
    - [ ] diffusion_dpo/train_diffusion_dpo_sdxl.py
    - [ ] diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
    - [ ] dreambooth_inpaint/train_dreambooth_inpaint.py
    - [ ] dreambooth_inpaint/train_dreambooth_inpaint_lora.py
    - [ ] instructpix2pix_lora/train_instruct_pix2pix_lora.py
    - [ ] intel_opts/textual_inversion/textual_inversion_bf16.py
    - [ ] intel_opts/textual_inversion_dfq/textual_inversion.py
    - [ ] lora/train_text_to_image_lora.py
    - [ ] multi_subject_dreambooth/train_multi_subject_dreambooth.py
    - [ ] multi_token_textual_inversion/textual_inversion.py
    - [ ] onnxruntime/text_to_image/train_text_to_image.py
    - [ ] onnxruntime/textual_inversion/textual_inversion.py
    - [ ] onnxruntime/unconditional_image_generation/train_unconditional.py
    - [ ] realfill/train_realfill.py
    - [ ] scheduled_huber_loss_training/dreambooth/train_dreambooth.py
    - [ ] scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
    - [ ] scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
    - [ ] scheduled_huber_loss_training/text_to_image/train_text_to_image.py
    - [ ] scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
    - [ ] scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
    - [ ] scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py

The following scripts do not have the argument `--num_train_epochs`: 

- amused
    - train_amused.py
- research_projects
    - multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

So, they don't need to be updated. 

Then we have the following scripts that don't use `accelerator` to prepare the datasets: 

Distributed dataset sharding is done by WebDataset, not `accelerator`. So, we can skip them for now. 

- consistency_distillation
    - train_lcm_distill_sd_wds.py
    - train_lcm_distill_sdxl_wds.py
    - train_lcm_distill_lora_sd_wds.py
    - train_lcm_distill_lora_sdxl_wds.py
- research_projects
    - controlnet/train_controlnet_webdataset.py
    - diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

## Steps to follow when opening PRs

* Target one AND only one training script in a single PR. 
* When you open a PR, please mention this issue. 
* Mention @sayakpaul and @geniuspatrick for a review. 
* Accompany your PR with a minimal training command using the `num_train_epochs` CLI arg. 
* Enjoy!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

Context

Steps to follow when opening PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

Description

Context

Steps to follow when opening PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384