-
Notifications
You must be signed in to change notification settings - Fork 6k
[SDXL] Partial diffusion support for Text2Img and Img2Img Pipelines #4015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
770feb7
2f142f6
a4c6217
c63a53a
062afbb
37d3dbc
577b791
2e27720
f7e8f05
fb04a33
664cdb7
4f4fd54
f6fce2d
50addfa
d258640
373aa94
f5c05f1
ce51e19
50ea36b
3a75d21
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,21 +59,117 @@ image = pipe(prompt=prompt).images[0] | |
|
||
### Refining the image output | ||
|
||
The image can be refined by making use of [stabilityai/stable-diffusion-xl-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9). | ||
In this case, you only have to output the `latents` from the base model. | ||
In addition to the [base model checkpoint](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9), | ||
StableDiffusion-XL also includes a [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9) | ||
that is specialized in denoising low-noise stage images to generate images of improved high-frequency quality. | ||
This refiner checkpoint can be used as a "second-step" pipeline after having run the base checkpoint to improve | ||
image quality. | ||
|
||
When using the refiner, one can easily | ||
- 1.) employ the base model and refiner as an *Ensemble of Expert Denoisers* as first proposed in [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/) or | ||
- 2.) simply run the refiner in [SDEdit](https://arxiv.org/abs/2108.01073) fashion after the base model. | ||
|
||
**Note**: The idea of using SD-XL base & refiner as an ensemble of experts was first brought forward by | ||
a couple community contributors which also helped shape the following `diffusers` implementation, namely: | ||
- [SytanSD](https://github.com/SytanSD) | ||
- [bghira](https://github.com/bghira) | ||
- [Birch-san](https://github.com/Birch-san) | ||
|
||
#### 1.) Ensemble of Expert Denoisers | ||
|
||
When using the base and refiner model as an ensemble of expert of denoisers, the base model should serve as the | ||
expert for the high-noise diffusion stage and the refiner serves as the expert for the low-noise diffusion stage. | ||
|
||
The advantage of 1.) over 2.) is that it requires less overall denoising steps and therefore should be significantly | ||
faster. The drawback is that one cannot really inspect the output of the base model; it will still be heavily denoised. | ||
|
||
To use the base model and refiner as an ensemble of expert denoisers, make sure to define the fraction | ||
of timesteps which should be run through the high-noise denoising stage (*i.e.* the base model) and the low-noise | ||
denoising stage (*i.e.* the refiner model) respectively. This fraction should be set as the [`~StableDiffusionXLPipeline.__call__.denoising_end`] of the base model | ||
and as the [`~StableDiffusionXLImg2ImgPipeline.__call__.denoising_start`] of the refiner model. | ||
|
||
Let's look at an example. | ||
First, we import the two pipelines. Since the text encoders and variational autoencoder are the same | ||
you don't have to load those again for the refiner. | ||
|
||
```py | ||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline | ||
from diffusers import DiffusionPipeline | ||
import torch | ||
|
||
pipe = StableDiffusionXLPipeline.from_pretrained( | ||
base = DiffusionPipeline.from_pretrained( | ||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True | ||
) | ||
pipe.to("cuda") | ||
|
||
use_refiner = True | ||
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained( | ||
"stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" | ||
refiner = DiffusionPipeline.from_pretrained( | ||
"stabilityai/stable-diffusion-xl-refiner-0.9", | ||
text_encoder_2=base.text_encoder_2, | ||
vae=base.vae, | ||
torch_dtype=torch.float16, | ||
use_safetensors=True, | ||
variant="fp16", | ||
) | ||
refiner.to("cuda") | ||
``` | ||
|
||
Now we define the number of inference steps and the fraction at which the model shall be run through the | ||
high-noise denoising stage (*i.e.* the base model). | ||
|
||
```py | ||
n_steps = 40 | ||
high_noise_frac = 0.7 | ||
``` | ||
|
||
A fraction of 0.7 means that 70% of the 40 inference steps (28 steps) are run through the base model | ||
and the remaining 12 steps are run through the refiner. Let's run the two pipelines now. | ||
Make sure to set `denoising_end` and `denoising_start` to the same values and keep `num_inference_steps` | ||
constant. Also remember that the output of the base model should be in latent space: | ||
|
||
```py | ||
prompt = "A majestic lion jumping from a big stone at night" | ||
|
||
image = base(prompt=prompt, num_inference_steps=n_steps, denoising_end=high_noise_frac, output_type="latent").images | ||
image = refiner(prompt=prompt, num_inference_steps=n_steps, denoising_start=high_noise_frac, image=image).images[0] | ||
``` | ||
|
||
Let's have a look at the image | ||
|
||
 | ||
|
||
If we would have just run the base model on the same 40 steps, the image would have been arguably less detailed (e.g. the lion eyes and nose): | ||
|
||
 | ||
|
||
<Tip> | ||
|
||
The ensemble-of-experts method works well on all available schedulers! | ||
|
||
</Tip> | ||
|
||
#### Refining the image output from fully denoised base image | ||
|
||
In standard [`StableDiffusionImg2ImgPipeline`]-fashion, the fully-denoised image generated of the base model | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
can be further improved using the [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: list advantages and disadvantages of this method
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be in favor of including these! If we can let's maybe also highlight the point on reducing inference latency? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Want to wait here a bit to see what SAI will use as the "official" way. |
||
For this, you simply run the refiner as a normal image-to-image pipeline after the "base" text-to-image | ||
pipeline. You can leave the outputs of the base model in latent space. | ||
|
||
```py | ||
from diffusers import DiffusionPipeline | ||
import torch | ||
|
||
pipe = DiffusionPipeline.from_pretrained( | ||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True | ||
) | ||
pipe.to("cuda") | ||
|
||
refiner = DiffusionPipeline.from_pretrained( | ||
"stabilityai/stable-diffusion-xl-refiner-0.9", | ||
text_encoder_2=pipe.text_encoder_2, | ||
vae=pipe.vae, | ||
torch_dtype=torch.float16, | ||
use_safetensors=True, | ||
variant="fp16", | ||
) | ||
refiner.to("cuda") | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -545,6 +545,7 @@ def __call__( | |
height: Optional[int] = None, | ||
width: Optional[int] = None, | ||
num_inference_steps: int = 50, | ||
denoising_end: Optional[float] = None, | ||
guidance_scale: float = 5.0, | ||
negative_prompt: Optional[Union[str, List[str]]] = None, | ||
num_images_per_prompt: Optional[int] = 1, | ||
|
@@ -579,6 +580,14 @@ def __call__( | |
num_inference_steps (`int`, *optional*, defaults to 50): | ||
The number of denoising steps. More denoising steps usually lead to a higher quality image at the | ||
expense of slower inference. | ||
denoising_end (`float`, *optional*): | ||
When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be | ||
completed before it is intentionally prematurely terminated. For instance, if denoising_end is set to | ||
0.7 and `num_inference_steps` is fixed at 50, the process will execute only 35 (i.e., 0.7 * 50) | ||
denoising steps. As a result, the returned sample will still retain a substantial amount of noise. The | ||
denoising_end parameter should ideally be utilized when this pipeline forms a part of a "Mixture of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as elaborated in [] |
||
Denoisers" multi-pipeline setup, as elaborated in [**Refining the Image | ||
Output**](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#refining-the-image-output) | ||
guidance_scale (`float`, *optional*, defaults to 7.5): | ||
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). | ||
`guidance_scale` is defined as `w` of equation 2. of [Imagen | ||
|
@@ -746,7 +755,13 @@ def __call__( | |
add_time_ids = add_time_ids.to(device).repeat(batch_size * num_images_per_prompt, 1) | ||
|
||
# 8. Denoising loop | ||
num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order | ||
num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i noticed that going negative, as well. |
||
|
||
# 7.1 Apply denoising_end | ||
if denoising_end is not None: | ||
num_inference_steps = int(round(denoising_end * num_inference_steps)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this resolved the timestamp bar not completing. however, i really liked that emergent property, as it meant any progress bar capturing the tqdm output would "just magically" have the progress bar continue on logically from one to the next. 😥 more work for that result, but i've been needing to refactor how i handle that, anyway... |
||
timesteps = timesteps[: num_warmup_steps + self.scheduler.order * num_inference_steps] | ||
|
||
with self.progress_bar(total=num_inference_steps) as progress_bar: | ||
for i, t in enumerate(timesteps): | ||
# expand the latents if we are doing classifier free guidance | ||
|
Uh oh!
There was an error while loading. Please reload this page.