Description
Describe the bug
When using mutil-gpu,and the data set is large or the resolution is high(need more time to inference), problems arise when saving the image.This is due to timeout in NCCL(1800s,30min),it can be solved by the following code.
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=7200))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.logger,
project_config=accelerator_project_config,
kwargs_handlers=[kwargs],
)
Maybe you can keep the default time, but add the code for people who need it
Reproduction
accelerate launch --mixed_precision="fp16" --multi_gpu train_unconditional.py --dataset_name="" --resolution=512--output_dir="" --train_batch_size=1 --num_epochs=100 --gradient_accumulation_steps=1 --learning_rate=1e-5 --lr_warmup_steps=500 --mixed_precision="fp16"
Logs
No response
System Info
diffusers
version: 0.16.1- Platform: Linux-4.4.0-210-generic-x86_64-with-glibc2.27
- Python version: 3.10.11
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.14.1
- Transformers version: 4.31.0.dev0
- Accelerate version: 0.19.0
- xFormers version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response