Wrong learning rate scheduler training step count for examples with multi-gpu.

### Describe the bug

For cosine learning rate schedules, the learning rate should approach zero when the training ends. However, due to accelerator prepare logic a learning rate scheduler will be stepped N times for each optimization step when there are N GPUs. Thus, we need to multiply the steps by `num_processes` when constructing schedulers.

Sample learning rate curve running dreambooth example on 4 GPUs:

![image](https://github.com/huggingface/diffusers/assets/23738781/9317d5b8-d84f-4a20-8137-c0ce2bba0272)


### Reproduction

Run the example scripts passing cosine learning rate schedules.

### Logs

_No response_

### System Info

- `diffusers` version: 0.17.1
- Platform: Linux-5.4.0-150-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.13
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.15.1
- Transformers version: 4.30.2
- Accelerate version: 0.20.3
- xFormers version: 0.0.20+1dc3d7a.d20230628
- Using GPU in script?: Y
- Using distributed or parallel set-up in script?: DDP

### Who can help?

@williamberman, @sayakpaul, @yiyixuxu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong learning rate scheduler training step count for examples with multi-gpu. #3954

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong learning rate scheduler training step count for examples with multi-gpu. #3954

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions