-
Notifications
You must be signed in to change notification settings - Fork 6k
Faster set_adapters #10777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster set_adapters #10777
Conversation
Thanks for this PR. Do you have any benchmarking numbers on this? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I made a small benchmark with Colab here. import time
from tqdm import tqdm
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
def load_n_lora(n):
pipe.unload_lora_weights()
adapter_names = []
for i in range(n):
adapter_names.append(f"floor{i}")
pipe.load_lora_weights("maria26/Floor_Plan_LoRA", adapter_name=f"floor{i}") # also very slow
return adapter_names
for n_lora in [1, 5, 10, 20, 50]:
adapter_names = load_n_lora(n_lora)
adapter_weights = [1./n_lora] * n_lora
start = time.time()
pipe.set_adapters(adapter_names, adapter_weights=adapter_weights)
end = time.time()
print(f"n_lora: {n_lora}, time: {end - start}")
it's consistenly 1.5 to 2x faster on SD1.5, and the difference will be more significant with larger models (e.g Flux) btw, |
Thanks for providing the benchmark!
Can you ensure you're using |
I rerun the benchmark, set
|
Can you confirm if the machine you're using to benchmark is shared by other uses? Sometimes that can perturbate the results. It's bit weird that you're experiencing such low load times (even though this issue should be in a different thread). We benchmarked it with Don't you think it's natural to see more load times with increasing number of LoRAs being loaded as we're also doing |
Oh I see, I forgot the unload is still inside the timing, i'm rerunning the benchmark now |
What is the expected time you would like so here? 👀 4 mins for 152 LoRAs seem reasonable to me. |
In the first 7 seconds, diffusers can load 20 LoRAs, so I expect it could load faster overall. nvm, I'll look into it more closely tomorrow since it's pretty late now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here LGTM, even if there was no speedup. Thanks.
Regarding the LoRA loading, I'd suggest to open another issue and check the underlying issue there.
Failing test is unrelated. Thanks for your contributions! |
What does this PR do?
The previous code iterated through
model.named_modules()
for each adapter, which can be very costly when the number of adapters reaches hundreds.I've slightly changed the logic to iterate over
model.named_modules()
only once, setting the adapters for each submodule within that pass.I haven't run extensive qualitative tests yet, but in my local experiment (Flux with 150+ adapters 😅) this change is significantly faster.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.