Skip to content

Wrong learning rate scheduler training step count for examples with multi-gpu. #3954

Closed
@eliphatfs

Description

@eliphatfs

Describe the bug

For cosine learning rate schedules, the learning rate should approach zero when the training ends. However, due to accelerator prepare logic a learning rate scheduler will be stepped N times for each optimization step when there are N GPUs. Thus, we need to multiply the steps by num_processes when constructing schedulers.

Sample learning rate curve running dreambooth example on 4 GPUs:

image

Reproduction

Run the example scripts passing cosine learning rate schedules.

Logs

No response

System Info

  • diffusers version: 0.17.1
  • Platform: Linux-5.4.0-150-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.13
  • PyTorch version (GPU?): 1.12.1 (True)
  • Huggingface_hub version: 0.15.1
  • Transformers version: 4.30.2
  • Accelerate version: 0.20.3
  • xFormers version: 0.0.20+1dc3d7a.d20230628
  • Using GPU in script?: Y
  • Using distributed or parallel set-up in script?: DDP

Who can help?

@williamberman, @sayakpaul, @yiyixuxu

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions