Skip to content

[SDXL] Partial diffusion support for Text2Img and Img2Img Pipelines #4015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jul 12, 2023

Conversation

bghira
Copy link
Contributor

@bghira bghira commented Jul 10, 2023

What does this PR do?

Fixes #4003

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 10, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me. For me, the following things are TODO:

More than happy to help if things are unclear.

@patrickvonplaten WDYT?

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jul 10, 2023

Hey @bghira,

Thanks for the PR, could you maybe post a couple of examples with diffusers code that show how this approach is better? Or I guess the workflow is as stated here: #4003 (comment) ?

Copy-pasting it here:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

# We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent",
              num_inference_steps=20, max_inference_steps=10).images

# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()

# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

# If add_noise is False, we will no longer add noise during Img2Img.
# This is useful since the base details will be preserved. However,
# switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image,
               num_inference_steps=20, first_inference_step=10, add_noise=False).images

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jul 10, 2023

Hey @bghira,

Very cool PR, I think I've understood the full workflow now. So essentially what we're doing here is that we use SD-XL base to denoise the image for high noise timesteps and we use the refiner to denoise the images for low noise timesteps.

In that sense your approach here is very similar to the idea of eDiffi - it's very cool that you made it work for SD-XL!

From the eDiffi paper:

Early in sampling,
generation strongly relies on the text prompt to generate textaligned content, while later, the text conditioning is >
almost
entirely ignored, and the task changes to producing outputs
of high visual fidelity.

=> So in a sense "base" is used here mostly to generate textaligned content and the refiner is used for high visual fidelity results.

Think we need to add tests and docs here as well :-)

@@ -619,6 +638,8 @@ def __call__(
] = None,
strength: float = 0.3,
num_inference_steps: int = 50,
max_inference_steps: Optional[int] = None,
Copy link
Contributor

@patrickvonplaten patrickvonplaten Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_inference_steps: Optional[int] = None,
denoising_end: Optional[float] = None,

Same as: https://github.com/huggingface/diffusers/pull/4015/files#r1257982804 => can we switch to % here maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if you use percentages the accuracy of the calculation and converting to integer means you might misalign timesteps, which results in artifacting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have an idea i will implement a helper method to return a pair of values from two inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you explain how you might accidentally misalign?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this can't really happen if we always round, e.g.:

max_inference_steps = int(round(denoising_end * num_inference_steps))
first_inference_steps = int(round(denoising_start * num_inference_Steps))

denoising_end should always be equal to 1 - denoising_start and in this case we will always have:

num_inference_steps = max_inference_steps + first_inference_steps

no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens when you have precise values for strength that result in remainders in the int() casting that become discarded. i have tested it:
1689000316 7576849ca8f41e6b56b1ff29f093680e55fbc6

In this test, num_inference_steps is set to 27 and strength is set to 0.73 - the calculation says that the model should run for about 19.71 steps, but since we can't run a fractional step, this would be truncated by int() to 19, but the value for begin_inference_step is 7.27 which truncates to 7.

This results in a misalignment of one step, which is, apparently, enough to disrupt the output altogether.

We can use round(), but instead, I have opted for what I've considered a more robust approach.

Furthermore, there are some prompts/seeds/subject material where these artifacts prove useful in making the output more variable and creative. Having the option to set those and override our best judgement seems like one of the things that makes Diffusers great, and I wanted to retain that.

Copy link
Collaborator

@yiyixuxu yiyixuxu Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @patrickvonplaten suggestion on using a percentage

It seems to me we will just have to set denoising_end = denoising_start, and default it to be (1- strength)?

then,
final_inference_steps = int(denoising_end * num_inference_steps)
begin_inference_steps = int(denoising_start * num_inference_steps)

so in the example num_inference_step = 27, strength= 0.73, we will have final_inference_steps = begin_inference_steps = 7, i.e run 7 steps in base model and 20 in refiner - is this what's intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive tested this, and while it seems like it should work, it has really broken the results for me, overall. i am digging into it.

@patrickvonplaten
Copy link
Contributor

Very excited about this PR. This makes a lot of sense. From a user perspective I think it would be nicer to provide fractions/%'s, such as denoising_end=0.6 and denoising_start=0.4.

I think this has two advantages:

  • 1.) IMO it's easier to grasp the concept, e.g. "For 60 % of the denoising process we use the "base" model" and "For the final 40% we use the refiner
  • 2.) It should create a better workflow as the user can change num_inference_steps just like before without having to change denoising_start or end.

So the this API could look as follows:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

# We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent", num_inference_steps=20, denoising_end=0.5).images

# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()

# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

# If add_noise is False, we will no longer add noise during Img2Img.
# This is useful since the base details will be preserved. However,
# switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image, num_inference_steps=20, denoising_start=0.5).images

@patrickvonplaten patrickvonplaten requested a review from pcuenca July 10, 2023 16:06
@bghira
Copy link
Contributor Author

bghira commented Jul 10, 2023

Here is an updated example:

Example:
    from diffusers import DiffusionPipeline
    import torch

    pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

    # if using torch < 2.0
    # pipe.enable_xformers_memory_efficient_attention()

    prompt = "An astronaut riding a green horse"

    # We're going to schedule 20 steps, and complete 50% of them using either model.
    total_num_steps = 20
    refiner_strength = 0.5
    # These values can be manipulated manually to get different results.
    final_inference_step, begin_inference_step = pipe.timesteps_from_strength(refiner_strength, total_num_steps)
    image = pipe(prompt=prompt, output_type="latent",
                num_inference_steps=total_num_steps, final_inference_step=final_inference_step).images

    # If you have low vram.
    # del pipe
    # import gc
    # gc.collect()
    # torch.clear_cache()
    # torch.clear_autocast_cache()

    # Put through the refiner now.
    pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

    # if using torch < 2.0
    # pipe.enable_xformers_memory_efficient_attention()

    images = pipe(prompt=prompt, image=image,
                num_inference_steps=total_num_steps, begin_inference_step=begin_inference_step).images

@bghira bghira force-pushed the partial-diffusion branch from a7f47cd to 942c6a5 Compare July 10, 2023 21:54
@bghira bghira force-pushed the partial-diffusion branch from 942c6a5 to 062afbb Compare July 11, 2023 01:28
@bghira
Copy link
Contributor Author

bghira commented Jul 11, 2023

I took a crack at that today, for a few hours. I couldn't make the percentages work reliably. The risk of artifacting is pretty great. If anyone else would like to take a swing. I welcome more commits and attention on this work.

Here's a comparison grid of the demo prompt with the fully denoised base model samples on the left column, and the partially denoised base model samples on the right column.

The image at the top centre is what the base model did on its own.

The results are using DDIM with 40 steps divided between the models from 10% up to 90% refiner strength.

https://tripleback.net/public/partial-diffusion-comparison.png

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic!

May I also have some a brief explanation of the two separate test cases being added?

IIUC, their purpose is to test for:

  • Make sure this feature works for a set of schedulers
  • Make sure denoising_end and denoising_start behaviour is robust
  • Make sure the internal computations for steps within the schedulers are as expected.

patrickvonplaten and others added 2 commits July 12, 2023 13:17
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
@patrickvonplaten
Copy link
Contributor

Failing test is unrelated - merging

@n00mkrad
Copy link

@patrickvonplaten @bghira

The docs for this seem to be wrong.
The docs say that denoising_end and denoising_start should be set to high_noise_frac, but this results in extremely smooth denoised images with no detail.

It should actually be n_steps * high_noise_frac for this value. However, when using this, the progress bar incorrectly show about 10x too many steps, but the resulting image is perfect.

@n00mkrad
Copy link

I was wrong. However I don't see the point of this method.

Here is fraction 0.7 (top) vs 1.0 (bottom):

image

Frac 1.0 also runs faster than 0.7.

I feel like I'm missing something here. Frac 1.0 should skip the refiner entirely, right? But then why is my result so, so much sharper than the 70/30 split of base/refiner?

@n00mkrad
Copy link

After hours of debugging I found the issue:

THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!

Should either be fixed (if possible) or mentioned in the docs.

@AmericanPresidentJimmyCarter
Copy link
Contributor

After hours of debugging I found the issue:

THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!

Should either be fixed (if possible) or mentioned in the docs.

Yes, I already opened an issue and I told them about the issue when they were implementing it, but it was still merged.

#4060

@bghira
Copy link
Contributor Author

bghira commented Jul 13, 2023

the Karras sigmas pattern is generally less tested and sees more issues like that. I have ceased using them for a month or more now, in favour of the rescaled DDIM timesteps.

@n00mkrad
Copy link

DPM++ 2M (without karras) is also very washed out and less detailed than with the regular method.

@bghira
Copy link
Contributor Author

bghira commented Jul 13, 2023

can you open a new issue report if you're seeing issues you'd like resolved? they're unlikely to receive the proper attention in a comment thread on a merged PR.

@n00mkrad
Copy link

I guess we're done here as #4060 already mentions improvements to this approach.

@patrickvonplaten
Copy link
Contributor

Let's maybe continue the discussion in #4115 :-)

orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request Aug 1, 2023
…uggingface#4015)

* diffusers#4003 - initial implementation of max_inference_steps

* diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img

* diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img

* diffusers#4003 Do not add noise during img2img when we have a defined first timestep

* diffusers#4003 Mild updates after revert

* diffusers#4003 Missing change

* Show implementation with denoising_start and end

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* move to 0.19.0dev

* Apply suggestions from code review

* add exhaustive tests

* add docs

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* make style

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request Aug 1, 2023
…uggingface#4015)

* diffusers#4003 - initial implementation of max_inference_steps

* diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img

* diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img

* diffusers#4003 Do not add noise during img2img when we have a defined first timestep

* diffusers#4003 Mild updates after revert

* diffusers#4003 Missing change

* Show implementation with denoising_start and end

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* move to 0.19.0dev

* Apply suggestions from code review

* add exhaustive tests

* add docs

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* make style

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
orpatashnik pushed a commit to orpatashnik/diffusers that referenced this pull request Aug 1, 2023
…uggingface#4015)

* diffusers#4003 - initial implementation of max_inference_steps

* diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img

* diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img

* diffusers#4003 Do not add noise during img2img when we have a defined first timestep

* diffusers#4003 Mild updates after revert

* diffusers#4003 Missing change

* Show implementation with denoising_start and end

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* move to 0.19.0dev

* Apply suggestions from code review

* add exhaustive tests

* add docs

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* make style

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
@bghira bghira deleted the partial-diffusion branch September 29, 2023 16:35
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
…uggingface#4015)

* diffusers#4003 - initial implementation of max_inference_steps

* diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img

* diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img

* diffusers#4003 Do not add noise during img2img when we have a defined first timestep

* diffusers#4003 Mild updates after revert

* diffusers#4003 Missing change

* Show implementation with denoising_start and end

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* move to 0.19.0dev

* Apply suggestions from code review

* add exhaustive tests

* add docs

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* make style

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
…uggingface#4015)

* diffusers#4003 - initial implementation of max_inference_steps

* diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img

* diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img

* diffusers#4003 Do not add noise during img2img when we have a defined first timestep

* diffusers#4003 Mild updates after revert

* diffusers#4003 Missing change

* Show implementation with denoising_start and end

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* move to 0.19.0dev

* Apply suggestions from code review

* add exhaustive tests

* add docs

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* make style

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SD-XL] Enhanced inference control
9 participants