Closed
Description
Describe the bug
When training the Autoencoderkl model, its loss does not converge on the ImageNet dataset. Unlike
this.
Reproduction
Script
accelerate launch --multi_gpu --num_processes=2 --gpu_ids=0,1 \
train_autoencoderkl.py \
--pretrained_model_name_or_path stabilityai/sd-vae-ft-mse \
--max_train_steps 850000 \
--validation_steps 100 \
--checkpointing_steps 1000 \
--gradient_accumulation_steps 2 \
--learning_rate 4.5e-6 \
--lr_scheduler cosine \
--report_to wandb \
--mixed_precision bf16 \
--train_batch_size 8 \
--dataloader_num_workers 16 \
--output_dir autoencoderkl-model/imagenet \
--train_data_dir /datasets/image/imagenet-test/train \
--validation_image ./val/ILSVRC2012_val_00000293.JPEG ./val/ILSVRC2012_val_00002138.JPEG \
--resolution 128 \
Logs
Logs
System Info
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.17
- Running on Google Colab?: No
- Python version: 3.8.20
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.1
- Transformers version: 4.46.3
- Accelerate version: 1.0.1
- PEFT version: not installed
- Bitsandbytes version: 0.45.4
- Safetensors version: 0.5.3
- xFormers version: 0.0.28.post1
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB
NVIDIA GeForce RTX 3090, 24576 MiB - Using GPU in script?:
- Using distributed or parallel set-up in script?: