Skip to content

Wrong learning rate scheduler training step count for examples with multi-gpu when setting --num_train_epochs #8236

Closed
@geniuspatrick

Description

@geniuspatrick

Describe the bug

I think there are still some problems with the learning rate scheduler. This is resolved when you set --max_train_steps, as discussed in #3954 , but not completely.

For example, the code snippet https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L816-L833 . I paste it here:

# Scheduler and math around the number of training steps.
overrode_max_train_steps = False
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if args.max_train_steps is None:
    args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
    overrode_max_train_steps = True

lr_scheduler = get_scheduler(
    args.lr_scheduler,
    optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
    num_training_steps=args.max_train_steps * accelerator.num_processes,
)

# Prepare everything with our `accelerator`.
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    unet, optimizer, train_dataloader, lr_scheduler
)

When setting --num_train_epochs instead of --max_train_steps, the calculation of num_update_steps_per_epoch is incorrect because train_dataloader has not yet been wrapped by accelerator.prepare. Consequently, args.max_train_steps is roughly num_processes times the actual value. This discrepancy leads to unintended values being passed into the get_scheduler function.

In fact, the logic here is quite confusing. It seems like a refactoring might be necessary.

Reproduction

accelerate launch --mixed_precision="fp16" train_text_to_image.py \
  ...
-  --max_train_steps=15000 \
+  --num_train_epochs=100 \
  ...

Logs

No response

System Info

  • diffusers version: 0.27.2
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.9.17
  • PyTorch version (GPU?): 2.0.1 (False)
  • Huggingface_hub version: 0.20.3
  • Transformers version: 4.30.0
  • Accelerate version: 0.21.0
  • xFormers version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@sayakpaul @yiyixuxu @eliphatfs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions