Description
Describe the bug
I think there are still some problems with the learning rate scheduler. This is resolved when you set --max_train_steps
, as discussed in #3954 , but not completely.
For example, the code snippet https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L816-L833 . I paste it here:
# Scheduler and math around the number of training steps.
overrode_max_train_steps = False
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if args.max_train_steps is None:
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
overrode_max_train_steps = True
lr_scheduler = get_scheduler(
args.lr_scheduler,
optimizer=optimizer,
num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
num_training_steps=args.max_train_steps * accelerator.num_processes,
)
# Prepare everything with our `accelerator`.
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
unet, optimizer, train_dataloader, lr_scheduler
)
When setting --num_train_epochs
instead of --max_train_steps
, the calculation of num_update_steps_per_epoch
is incorrect because train_dataloader
has not yet been wrapped by accelerator.prepare
. Consequently, args.max_train_steps
is roughly num_processes
times the actual value. This discrepancy leads to unintended values being passed into the get_scheduler
function.
In fact, the logic here is quite confusing. It seems like a refactoring might be necessary.
Reproduction
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
...
- --max_train_steps=15000 \
+ --num_train_epochs=100 \
...
Logs
No response
System Info
diffusers
version: 0.27.2- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.9.17
- PyTorch version (GPU?): 2.0.1 (False)
- Huggingface_hub version: 0.20.3
- Transformers version: 4.30.0
- Accelerate version: 0.21.0
- xFormers version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no