Skip to content

Faster set_adapters #10777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 12, 2025
Merged

Faster set_adapters #10777

merged 4 commits into from
Feb 12, 2025

Conversation

Luvata
Copy link
Contributor

@Luvata Luvata commented Feb 12, 2025

What does this PR do?

The previous code iterated through model.named_modules() for each adapter, which can be very costly when the number of adapters reaches hundreds.

I've slightly changed the logic to iterate over model.named_modules() only once, setting the adapters for each submodule within that pass.

I haven't run extensive qualitative tests yet, but in my local experiment (Flux with 150+ adapters 😅) this change is significantly faster.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
Copy link
Member

Thanks for this PR. Do you have any benchmarking numbers on this?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Luvata
Copy link
Contributor Author

Luvata commented Feb 12, 2025

I made a small benchmark with Colab here.

import time
from tqdm import tqdm
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")


def load_n_lora(n):
    pipe.unload_lora_weights()
    adapter_names = []
    for i in range(n):
        adapter_names.append(f"floor{i}")
        pipe.load_lora_weights("maria26/Floor_Plan_LoRA", adapter_name=f"floor{i}") # also very slow
    return adapter_names

for n_lora in [1, 5, 10, 20, 50]:
    adapter_names = load_n_lora(n_lora)
    adapter_weights = [1./n_lora] * n_lora

    start = time.time()
    pipe.set_adapters(adapter_names, adapter_weights=adapter_weights)
    end = time.time()
    print(f"n_lora: {n_lora}, time: {end - start}")
n_lora main branch set_adapters faster set_adapters load_lora_weights
1 0.0164 0.012 0.466
5 0.118 0.067 2.763
10 0.699 0.206 6.082
20 1.762 1.283 14.064
50 9.661 5.975 49.573

it's consistenly 1.5 to 2x faster on SD1.5, and the difference will be more significant with larger models (e.g Flux)

btw, load_lora_weights is also very slow as the number of LoRA increases, but since in my use case it runs only once at first to load all lora, I didn't look into optimizing it.

@sayakpaul
Copy link
Member

Thanks for providing the benchmark!

btw, load_lora_weights is also very slow as the number of LoRA increases, but since in my use case it runs only once at first to load all lora, I didn't look into optimizing it.

Can you ensure you're using low_cpu_mem_usage=True. By default it should be set to True based on your env but just double-checking.

@Luvata
Copy link
Contributor Author

Luvata commented Feb 12, 2025

I rerun the benchmark, set low_cpu_mem_usage=True, average over 3 runs

load_lora_weights still very slow. I will look into it tomorrow

n_lora main: load_lora_weights main: set_adapters this branch: load_lora_weights set_adapters
1 0.482 ± 0.027 0.017 ± 0.001 0.521 ± 0.064 0.013 ± 0.000
5 2.532 ± 0.245 0.266 ± 0.218 2.576 ± 0.136 0.074 ± 0.001
10 5.734 ± 0.157 0.470 ± 0.173 5.271 ± 0.359 0.281 ± 0.069
20 13.121 ± 0.204 1.926 ± 0.190 13.936 ± 1.141 1.283 ± 0.061
50 50.295 ± 0.381 10.506 ± 0.027 49.062 ± 1.400 6.517 ± 0.327

@sayakpaul
Copy link
Member

sayakpaul commented Feb 12, 2025

Can you confirm if the machine you're using to benchmark is shared by other uses? Sometimes that can perturbate the results.

It's bit weird that you're experiencing such low load times (even though this issue should be in a different thread). We benchmarked it with low_cpu_mem_usage and you can check the results here: #9510.

Don't you think it's natural to see more load times with increasing number of LoRAs being loaded as we're also doing unload_lora_weights()?

@Luvata
Copy link
Contributor Author

Luvata commented Feb 12, 2025

Oh I see, I forgot the unload is still inside the timing, i'm rerunning the benchmark now
btw, for reference, my local experiment with Flux Schnell, loading LoRA weights with the default setting for 152 adapters (rank 4) takes around 4 minutes with no unload.

@Luvata
Copy link
Contributor Author

Luvata commented Feb 12, 2025

Btw this is my screenshot of loading 152 lora with Flux Schnell, rank 4. It looks fast at first but then getting slower and slower

It’s running in my lab cluster, only me, not shared with anyone, and low_cpu_mem_usage is set to True.

image image image

I've rerun twice, still 5 minutes
I should open a new issue for load_lora_weights

@sayakpaul
Copy link
Member

btw, for reference, my local experiment with Flux Schnell, loading LoRA weights with the default setting for 152 adapters (rank 4) takes around 4 minutes with no unload.

What is the expected time you would like so here? 👀 4 mins for 152 LoRAs seem reasonable to me.

@Luvata
Copy link
Contributor Author

Luvata commented Feb 12, 2025

In the first 7 seconds, diffusers can load 20 LoRAs, so I expect it could load faster overall. nvm, I'll look into it more closely tomorrow since it's pretty late now

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here LGTM, even if there was no speedup. Thanks.

Regarding the LoRA loading, I'd suggest to open another issue and check the underlying issue there.

@sayakpaul
Copy link
Member

Failing test is unrelated. Thanks for your contributions!

@sayakpaul sayakpaul merged commit 067eab1 into huggingface:main Feb 12, 2025
11 of 12 checks passed
@Luvata Luvata deleted the faster-set-adapters branch February 13, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants