[SDXL] Partial diffusion support for Text2Img and Img2Img Pipelines #4015

bghira · 2023-07-10T04:57:09Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

General functionalities: @patrickvonplaten and @sayakpaul

…rst_inference_step for img2img

…estamps in img2img

… first timestep

HuggingFaceDocBuilderDev · 2023-07-10T05:04:02Z

The documentation is not available anymore as the PR was closed or merged.

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

sayakpaul

This looks pretty good to me. For me, the following things are TODO:

Ensure the CI is green.
Docs -- let's make sure to add an elaborate section about these things in our docs here: https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl.
Add test cases to ensure feature robustness and parity.

More than happy to help if things are unclear.

@patrickvonplaten WDYT?

patrickvonplaten · 2023-07-10T08:50:25Z

Hey @bghira,

Thanks for the PR, could you maybe post a couple of examples with diffusers code that show how this approach is better? Or I guess the workflow is as stated here: #4003 (comment) ?

Copy-pasting it here:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

# We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent",
              num_inference_steps=20, max_inference_steps=10).images

# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()

# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

# If add_noise is False, we will no longer add noise during Img2Img.
# This is useful since the base details will be preserved. However,
# switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image,
               num_inference_steps=20, first_inference_step=10, add_noise=False).images

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

patrickvonplaten · 2023-07-10T09:23:44Z

Hey @bghira,

Very cool PR, I think I've understood the full workflow now. So essentially what we're doing here is that we use SD-XL base to denoise the image for high noise timesteps and we use the refiner to denoise the images for low noise timesteps.

In that sense your approach here is very similar to the idea of eDiffi - it's very cool that you made it work for SD-XL!

From the eDiffi paper:

Early in sampling,
generation strongly relies on the text prompt to generate textaligned content, while later, the text conditioning is >
almost
entirely ignored, and the task changes to producing outputs
of high visual fidelity.

=> So in a sense "base" is used here mostly to generate textaligned content and the refiner is used for high visual fidelity results.

Think we need to add tests and docs here as well :-)

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

patrickvonplaten · 2023-07-10T09:32:18Z

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

@@ -619,6 +638,8 @@ def __call__(
        ] = None,
        strength: float = 0.3,
        num_inference_steps: int = 50,
+        max_inference_steps: Optional[int] = None,


Suggested change

max_inference_steps: Optional[int] = None,

denoising_end: Optional[float] = None,

Same as: https://github.com/huggingface/diffusers/pull/4015/files#r1257982804 => can we switch to % here maybe?

so if you use percentages the accuracy of the calculation and converting to integer means you might misalign timesteps, which results in artifacting

Fair point!

i have an idea i will implement a helper method to return a pair of values from two inputs.

Hmm can you explain how you might accidentally misalign?

Actually I think this can't really happen if we always round, e.g.:

max_inference_steps = int(round(denoising_end * num_inference_steps)) first_inference_steps = int(round(denoising_start * num_inference_Steps))

denoising_end should always be equal to 1 - denoising_start and in this case we will always have:

num_inference_steps = max_inference_steps + first_inference_steps

no?

it happens when you have precise values for strength that result in remainders in the int() casting that become discarded. i have tested it:

In this test, num_inference_steps is set to 27 and strength is set to 0.73 - the calculation says that the model should run for about 19.71 steps, but since we can't run a fractional step, this would be truncated by int() to 19, but the value for begin_inference_step is 7.27 which truncates to 7.

This results in a misalignment of one step, which is, apparently, enough to disrupt the output altogether.

We can use round(), but instead, I have opted for what I've considered a more robust approach.

Furthermore, there are some prompts/seeds/subject material where these artifacts prove useful in making the output more variable and creative. Having the option to set those and override our best judgement seems like one of the things that makes Diffusers great, and I wanted to retain that.

+1 on @patrickvonplaten suggestion on using a percentage

It seems to me we will just have to set denoising_end = denoising_start, and default it to be (1- strength)?

then,
final_inference_steps = int(denoising_end * num_inference_steps)
begin_inference_steps = int(denoising_start * num_inference_steps)

so in the example num_inference_step = 27, strength= 0.73, we will have final_inference_steps = begin_inference_steps = 7, i.e run 7 steps in base model and 20 in refiner - is this what's intended?

ive tested this, and while it seems like it should work, it has really broken the results for me, overall. i am digging into it.

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

patrickvonplaten · 2023-07-10T09:46:38Z

Very excited about this PR. This makes a lot of sense. From a user perspective I think it would be nicer to provide fractions/%'s, such as denoising_end=0.6 and denoising_start=0.4.

I think this has two advantages:

1.) IMO it's easier to grasp the concept, e.g. "For 60 % of the denoising process we use the "base" model" and "For the final 40% we use the refiner
2.) It should create a better workflow as the user can change num_inference_steps just like before without having to change denoising_start or end.

So the this API could look as follows:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

# We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent", num_inference_steps=20, denoising_end=0.5).images

# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()

# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

# If add_noise is False, we will no longer add noise during Img2Img.
# This is useful since the base details will be preserved. However,
# switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image, num_inference_steps=20, denoising_start=0.5).images

bghira · 2023-07-10T17:14:55Z

Here is an updated example:

Example:
    from diffusers import DiffusionPipeline
    import torch

    pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

    # if using torch < 2.0
    # pipe.enable_xformers_memory_efficient_attention()

    prompt = "An astronaut riding a green horse"

    # We're going to schedule 20 steps, and complete 50% of them using either model.
    total_num_steps = 20
    refiner_strength = 0.5
    # These values can be manipulated manually to get different results.
    final_inference_step, begin_inference_step = pipe.timesteps_from_strength(refiner_strength, total_num_steps)
    image = pipe(prompt=prompt, output_type="latent",
                num_inference_steps=total_num_steps, final_inference_step=final_inference_step).images

    # If you have low vram.
    # del pipe
    # import gc
    # gc.collect()
    # torch.clear_cache()
    # torch.clear_autocast_cache()

    # Put through the refiner now.
    pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

    # if using torch < 2.0
    # pipe.enable_xformers_memory_efficient_attention()

    images = pipe(prompt=prompt, image=image,
                num_inference_steps=total_num_steps, begin_inference_step=begin_inference_step).images

bghira · 2023-07-11T01:49:10Z

I took a crack at that today, for a few hours. I couldn't make the percentages work reliably. The risk of artifacting is pretty great. If anyone else would like to take a swing. I welcome more commits and attention on this work.

Here's a comparison grid of the demo prompt with the fully denoised base model samples on the left column, and the partially denoised base model samples on the right column.

The image at the top centre is what the base model did on its own.

The results are using DDIM with 40 steps divided between the models from 10% up to 90% refiner strength.

https://tripleback.net/public/partial-diffusion-comparison.png

sayakpaul

Looks fantastic!

May I also have some a brief explanation of the two separate test cases being added?

IIUC, their purpose is to test for:

Make sure this feature works for a set of schedulers
Make sure denoising_end and denoising_start behaviour is robust
Make sure the internal computations for steps within the schedulers are as expected.

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.mdx

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

patrickvonplaten · 2023-07-12T11:54:24Z

Failing test is unrelated - merging

n00mkrad · 2023-07-13T16:43:06Z

@patrickvonplaten @bghira

The docs for this seem to be wrong.
The docs say that denoising_end and denoising_start should be set to high_noise_frac, but this results in extremely smooth denoised images with no detail.

It should actually be n_steps * high_noise_frac for this value. However, when using this, the progress bar incorrectly show about 10x too many steps, but the resulting image is perfect.

n00mkrad · 2023-07-13T16:58:29Z

I was wrong. However I don't see the point of this method.

Here is fraction 0.7 (top) vs 1.0 (bottom):

Frac 1.0 also runs faster than 0.7.

I feel like I'm missing something here. Frac 1.0 should skip the refiner entirely, right? But then why is my result so, so much sharper than the 70/30 split of base/refiner?

n00mkrad · 2023-07-13T18:48:04Z

After hours of debugging I found the issue:

THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!

Should either be fixed (if possible) or mentioned in the docs.

AmericanPresidentJimmyCarter · 2023-07-13T18:55:33Z

After hours of debugging I found the issue:

THIS METHOD DOES NOT WORK WITH KARRAS SCHEDULERS!

Should either be fixed (if possible) or mentioned in the docs.

Yes, I already opened an issue and I told them about the issue when they were implementing it, but it was still merged.

#4060

bghira · 2023-07-13T19:04:00Z

the Karras sigmas pattern is generally less tested and sees more issues like that. I have ceased using them for a month or more now, in favour of the rescaled DDIM timesteps.

n00mkrad · 2023-07-13T19:34:42Z

DPM++ 2M (without karras) is also very washed out and less detailed than with the regular method.

bghira · 2023-07-13T19:48:59Z

can you open a new issue report if you're seeing issues you'd like resolved? they're unlikely to receive the proper attention in a comment thread on a merged PR.

n00mkrad · 2023-07-13T20:36:20Z

I guess we're done here as #4060 already mentions improvements to this approach.

patrickvonplaten · 2023-07-17T14:19:05Z

Let's maybe continue the discussion in #4115 :-)

…uggingface#4015) * diffusers#4003 - initial implementation of max_inference_steps * diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img * diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img * diffusers#4003 Do not add noise during img2img when we have a defined first timestep * diffusers#4003 Mild updates after revert * diffusers#4003 Missing change * Show implementation with denoising_start and end * Apply suggestions from code review * Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * move to 0.19.0dev * Apply suggestions from code review * add exhaustive tests * add docs * finish * Apply suggestions from code review Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * make style --------- Co-authored-by: bghira <bghira@users.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

bghira added 4 commits July 9, 2023 15:20

diffusers#4003 - initial implementation of max_inference_steps

770feb7

diffusers#4003 - initial implementation of max_inference_steps and fi…

2f142f6

…rst_inference_step for img2img

diffusers#4003 - use first_inference_step as an input arg for get_tim…

a4c6217

…estamps in img2img

diffusers#4003 Do not add noise during img2img when we have a defined…

c63a53a

… first timestep

sayakpaul reviewed Jul 10, 2023

View reviewed changes

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py Outdated Show resolved Hide resolved

sayakpaul reviewed Jul 10, 2023

View reviewed changes

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py Outdated Show resolved Hide resolved

sayakpaul reviewed Jul 10, 2023

View reviewed changes

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py Outdated Show resolved Hide resolved

sayakpaul reviewed Jul 10, 2023

View reviewed changes

sayakpaul requested a review from patrickvonplaten July 10, 2023 07:25

patrickvonplaten reviewed Jul 10, 2023

View reviewed changes

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py Show resolved Hide resolved