[training] fix training resuming problem for fp16 (SD LoRA DreamBooth) #6554

a-r-r-o-w · 2024-01-12T12:56:49Z

What does this PR do?

Part of #6552.

I'm yet to test it on a training run. I think Sayak mentioned he'll be opening a follow-up PR to add a utility function that removes the duplicated code.

First run:

CUDA_VISIBLE_DEVICES=0 accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
  --instance_data_dir="dog" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --output_dir="lora-dog-sd" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=4 --checkpointing_steps=2 --checkpoints_total_limit=2 \
  --use_8bit_adam \
  --seed="42"

Resume training:

CUDA_VISIBLE_DEVICES=0 accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
  --instance_data_dir="dog" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --output_dir="lora-dog-sd" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=4 --checkpointing_steps=2 --checkpoints_total_limit=2 \
  --resume_from_checkpoint="latest" \
  --use_8bit_adam \
  --seed="42"

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul

sayakpaul · 2024-01-12T13:02:59Z

Super fast! I have updated the description of #6552. It should be clear now.

HuggingFaceDocBuilderDev · 2024-01-12T13:03:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2024-01-15T11:44:27Z

@a-r-r-o-w let me know if I can test it.

a-r-r-o-w · 2024-01-15T19:20:54Z

Hi Sayak, this is ready for testing I believe, and seems to be working well. Here are my logs:

sayakpaul · 2024-01-16T01:57:01Z

Just tested. Seems to be working well! Thank you!

huggingface#6554) * fix training resume * update * update

fix training resume

cc4abfd

a-r-r-o-w added 2 commits January 16, 2024 00:31

Merge remote-tracking branch 'origin/main' into fix-fp16-dreambooth-sd

6970c71

update

9bb5760

update

9f76b50

sayakpaul merged commit c11de13 into huggingface:main Jan 16, 2024

AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024

[training] fix training resuming problem for fp16 (SD LoRA DreamBooth) (

2936984

huggingface#6554) * fix training resume * update * update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[training] fix training resuming problem for fp16 (SD LoRA DreamBooth) #6554

[training] fix training resuming problem for fp16 (SD LoRA DreamBooth) #6554

Uh oh!

a-r-r-o-w commented Jan 12, 2024 •

edited

Loading

Uh oh!

sayakpaul commented Jan 12, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 12, 2024

Uh oh!

sayakpaul commented Jan 15, 2024

Uh oh!

a-r-r-o-w commented Jan 15, 2024

Uh oh!

sayakpaul commented Jan 16, 2024

Uh oh!

Uh oh!

[training] fix training resuming problem for fp16 (SD LoRA DreamBooth) #6554

[training] fix training resuming problem for fp16 (SD LoRA DreamBooth) #6554

Uh oh!

Conversation

a-r-r-o-w commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Jan 12, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jan 12, 2024

Uh oh!

sayakpaul commented Jan 15, 2024

Uh oh!

a-r-r-o-w commented Jan 15, 2024

Uh oh!

sayakpaul commented Jan 16, 2024

Uh oh!

Uh oh!

a-r-r-o-w commented Jan 12, 2024 •

edited

Loading