Skip to content

Commit 599c887

Browse files
sayakpaulSunMarc
andauthored
feat: pipeline-level quantization config (#11130)
* feat: pipeline-level quant config. Co-authored-by: SunMarc <marc.sun@hotmail.fr> condition better. support mapping. improvements. [Quantization] Add Quanto backend (#10756) * update * updaet * update * update * update * update * update * update * update * update * update * update * Update docs/source/en/quantization/quanto.md Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * Update src/diffusers/quantizers/quanto/utils.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * update * update --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> [Single File] Add single file loading for SANA Transformer (#10947) * added support for from_single_file * added diffusers mapping script * added testcase * bug fix * updated tests * corrected code quality * corrected code quality --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> [LoRA] Improve warning messages when LoRA loading becomes a no-op (#10187) * updates * updates * updates * updates * notebooks revert * fix-copies. * seeing * fix * revert * fixes * fixes * fixes * remove print * fix * conflicts ii. * updates * fixes * better filtering of prefix. --------- Co-authored-by: hlky <hlky@hlky.ac> [LoRA] CogView4 (#10981) * update * make fix-copies * update [Tests] improve quantization tests by additionally measuring the inference memory savings (#11021) * memory usage tests * fixes * gguf [`Research Project`] Add AnyText: Multilingual Visual Text Generation And Editing (#8998) * Add initial template * Second template * feat: Add TextEmbeddingModule to AnyTextPipeline * feat: Add AuxiliaryLatentModule template to AnyTextPipeline * Add bert tokenizer from the anytext repo for now * feat: Update AnyTextPipeline's modify_prompt method This commit adds improvements to the modify_prompt method in the AnyTextPipeline class. The method now handles special characters and replaces selected string prompts with a placeholder. Additionally, it includes a check for Chinese text and translation using the trans_pipe. * Fill in the `forward` pass of `AuxiliaryLatentModule` * `make style && make quality` * `chore: Update bert_tokenizer.py with a TODO comment suggesting the use of the transformers library` * Update error handling to raise and logging * Add `create_glyph_lines` function into `TextEmbeddingModule` * make style * Up * Up * Up * Up * Remove several comments * refactor: Remove ControlNetConditioningEmbedding and update code accordingly * Up * Up * up * refactor: Update AnyTextPipeline to include new optional parameters * up * feat: Add OCR model and its components * chore: Update `TextEmbeddingModule` to include OCR model components and dependencies * chore: Update `AuxiliaryLatentModule` to include VAE model and its dependencies for masked image in the editing task * `make style` * refactor: Update `AnyTextPipeline`'s docstring * Update `AuxiliaryLatentModule` to include info dictionary so that text processing is done once * simplify * `make style` * Converting `TextEmbeddingModule` to ordinary `encode_prompt()` function * Simplify for now * `make style` * Up * feat: Add scripts to convert AnyText controlnet to diffusers * `make style` * Fix: Move glyph rendering to `TextEmbeddingModule` from `AuxiliaryLatentModule` * make style * Up * Simplify * Up * feat: Add safetensors module for loading model file * Fix device issues * Up * Up * refactor: Simplify * refactor: Simplify code for loading models and handling data types * `make style` * refactor: Update to() method in FrozenCLIPEmbedderT3 and TextEmbeddingModule * refactor: Update dtype in embedding_manager.py to match proj.weight * Up * Add attribution and adaptation information to pipeline_anytext.py * Update usage example * Will refactor `controlnet_cond_embedding` initialization * Add `AnyTextControlNetConditioningEmbedding` template * Refactor organization * style * style * Move custom blocks from `AuxiliaryLatentModule` to `AnyTextControlNetConditioningEmbedding` * Follow one-file policy * style * [Docs] Update README and pipeline_anytext.py to use AnyTextControlNetModel * [Docs] Update import statement for AnyTextControlNetModel in pipeline_anytext.py * [Fix] Update import path for ControlNetModel, ControlNetOutput in anytext_controlnet.py * Refactor AnyTextControlNet to use configurable conditioning embedding channels * Complete control net conditioning embedding in AnyTextControlNetModel * up * [FIX] Ensure embeddings use correct device in AnyTextControlNetModel * up * up * style * [UPDATE] Revise README and example code for AnyTextPipeline integration with DiffusionPipeline * [UPDATE] Update example code in anytext.py to use correct font file and improve clarity * down * [UPDATE] Refactor BasicTokenizer usage to a new Checker class for text processing * update pillow * [UPDATE] Remove commented-out code and unnecessary docstring in anytext.py and anytext_controlnet.py for improved clarity * [REMOVE] Delete frozen_clip_embedder_t3.py as it is in the anytext.py file * [UPDATE] Replace edict with dict for configuration in anytext.py and RecModel.py for consistency * 🆙 * style * [UPDATE] Revise README.md for clarity, remove unused imports in anytext.py, and add author credits in anytext_controlnet.py * style * Update examples/research_projects/anytext/README.md Co-authored-by: Aryan <contact.aryanvs@gmail.com> * Remove commented-out image preparation code in AnyTextPipeline * Remove unnecessary blank line in README.md [Quantization] Allow loading TorchAO serialized Tensor objects with torch>=2.6 (#11018) * update * update * update * update * update * update * update * update * update fix: mixture tiling sdxl pipeline - adjust gerating time_ids & embeddings (#11012) small fix on generating time_ids & embeddings [LoRA] support wan i2v loras from the world. (#11025) * support wan i2v loras from the world. * remove copied from. * upates * add lora. Fix SD3 IPAdapter feature extractor (#11027) chore: fix help messages in advanced diffusion examples (#10923) Fix missing **kwargs in lora_pipeline.py (#11011) * Update lora_pipeline.py * Apply style fixes * fix-copies --------- Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Fix for multi-GPU WAN inference (#10997) Ensure that hidden_state and shift/scale are on the same device when running with multiple GPUs Co-authored-by: Jimmy <39@🇺🇸.com> [Refactor] Clean up import utils boilerplate (#11026) * update * update * update Use `output_size` in `repeat_interleave` (#11030) [hybrid inference 🍯🐝] Add VAE encode (#11017) * [hybrid inference 🍯🐝] Add VAE encode * _toctree: add vae encode * Add endpoints, tests * vae_encode docs * vae encode benchmarks * api reference * changelog * Update docs/source/en/hybrid_inference/overview.md Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * update --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Wan Pipeline scaling fix, type hint warning, multi generator fix (#11007) * Wan Pipeline scaling fix, type hint warning, multi generator fix * Apply suggestions from code review [LoRA] change to warning from info when notifying the users about a LoRA no-op (#11044) * move to warning. * test related changes. Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline (#10827) * Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> making ```formatted_images``` initialization compact (#10801) compact writing Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com> Fix aclnnRepeatInterleaveIntWithDim error on NPU for get_1d_rotary_pos_embed (#10820) * get_1d_rotary_pos_embed support npu * Update src/diffusers/models/embeddings.py --------- Co-authored-by: Kai zheng <kaizheng@KaideMacBook-Pro.local> Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: YiYi Xu <yixu310@gmail.com> [Tests] restrict memory tests for quanto for certain schemes. (#11052) * restrict memory tests for quanto for certain schemes. * Apply suggestions from code review Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> * fixes * style --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> [LoRA] feat: support non-diffusers wan t2v loras. (#11059) feat: support non-diffusers wan t2v loras. [examples/controlnet/train_controlnet_sd3.py] Fixes #11050 - Cast prompt_embeds and pooled_prompt_embeds to weight_dtype to prevent dtype mismatch (#11051) Fix: dtype mismatch of prompt embeddings in sd3 controlnet training Co-authored-by: Andreas Jörg <andreasjoerg@MacBook-Pro-von-Andreas-2.fritz.box> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> reverts accidental change that removes attn_mask in attn. Improves fl… (#11065) reverts accidental change that removes attn_mask in attn. Improves flux ptxla by using flash block sizes. Moves encoding outside the for loop. Co-authored-by: Juan Acevedo <jfacevedo@google.com> Fix deterministic issue when getting pipeline dtype and device (#10696) Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> [Tests] add requires peft decorator. (#11037) * add requires peft decorator. * install peft conditionally. * conditional deps. Co-authored-by: DN6 <dhruv.nair@gmail.com> --------- Co-authored-by: DN6 <dhruv.nair@gmail.com> CogView4 Control Block (#10809) * cogview4 control training --------- Co-authored-by: OleehyO <leehy0357@gmail.com> Co-authored-by: yiyixuxu <yixu310@gmail.com> [CI] pin transformers version for benchmarking. (#11067) pin transformers version for benchmarking. updates Fix Wan I2V Quality (#11087) * fix_wan_i2v_quality * Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update pipeline_wan_i2v.py --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: hlky <hlky@hlky.ac> LTX 0.9.5 (#10968) * update --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: hlky <hlky@hlky.ac> make PR GPU tests conditioned on styling. (#11099) Group offloading improvements (#11094) update Fix pipeline_flux_controlnet.py (#11095) * Fix pipeline_flux_controlnet.py * Fix style update readme instructions. (#11096) Co-authored-by: Juan Acevedo <jfacevedo@google.com> Resolve stride mismatch in UNet's ResNet to support Torch DDP (#11098) Modify UNet's ResNet implementation to resolve stride mismatch in Torch's DDP Fix Group offloading behaviour when using streams (#11097) * update * update Quality options in `export_to_video` (#11090) * Quality options in `export_to_video` * make style improve more. add placeholders for docstrings. formatting. smol fix. solidify validation and annotation * Revert "feat: pipeline-level quant config." This reverts commit 316ff46. * feat: implement pipeline-level quantization config Co-authored-by: SunMarc <marc@huggingface.co> * update * fixes * fix validation. * add tests and other improvements. * add tests * import quality * remove prints. * add docs. * fixes to docs. * doc fixes. * doc fixes. * add validation to the input quantization_config. * clarify recommendations. * docs * add to ci. * todo. --------- Co-authored-by: SunMarc <marc@huggingface.co>
1 parent 393aefc commit 599c887

File tree

9 files changed

+541
-4
lines changed

9 files changed

+541
-4
lines changed

.github/workflows/nightly_tests.yml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,60 @@ jobs:
525525
pip install slack_sdk tabulate
526526
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
527527
528+
run_nightly_pipeline_level_quantization_tests:
529+
name: Torch quantization nightly tests
530+
strategy:
531+
fail-fast: false
532+
max-parallel: 2
533+
runs-on:
534+
group: aws-g6e-xlarge-plus
535+
container:
536+
image: diffusers/diffusers-pytorch-cuda
537+
options: --shm-size "20gb" --ipc host --gpus 0
538+
steps:
539+
- name: Checkout diffusers
540+
uses: actions/checkout@v3
541+
with:
542+
fetch-depth: 2
543+
- name: NVIDIA-SMI
544+
run: nvidia-smi
545+
- name: Install dependencies
546+
run: |
547+
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
548+
python -m uv pip install -e [quality,test]
549+
python -m uv pip install -U bitsandbytes optimum_quanto
550+
python -m uv pip install pytest-reportlog
551+
- name: Environment
552+
run: |
553+
python utils/print_env.py
554+
- name: Pipeline-level quantization tests on GPU
555+
env:
556+
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
557+
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
558+
CUBLAS_WORKSPACE_CONFIG: :16:8
559+
BIG_GPU_MEMORY: 40
560+
run: |
561+
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
562+
--make-reports=tests_pipeline_level_quant_torch_cuda \
563+
--report-log=tests_pipeline_level_quant_torch_cuda.log \
564+
tests/quantization/test_pipeline_level_quantization.py
565+
- name: Failure short reports
566+
if: ${{ failure() }}
567+
run: |
568+
cat reports/tests_pipeline_level_quant_torch_cuda_stats.txt
569+
cat reports/tests_pipeline_level_quant_torch_cuda_failures_short.txt
570+
- name: Test suite reports artifacts
571+
if: ${{ always() }}
572+
uses: actions/upload-artifact@v4
573+
with:
574+
name: torch_cuda_pipeline_level_quant_reports
575+
path: reports
576+
- name: Generate Report and Notify Channel
577+
if: always()
578+
run: |
579+
pip install slack_sdk tabulate
580+
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
581+
528582
# M1 runner currently not well supported
529583
# TODO: (Dhruv) add these back when we setup better testing for Apple Silicon
530584
# run_nightly_tests_apple_m1:

docs/source/en/api/quantization.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,17 @@ specific language governing permissions and limitations under the License.
1313

1414
# Quantization
1515

16-
Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index).
17-
18-
Quantization techniques that aren't supported in Transformers can be added with the [`DiffusersQuantizer`] class.
16+
Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference.
1917

2018
<Tip>
2119

2220
Learn how to quantize models in the [Quantization](../quantization/overview) guide.
2321

2422
</Tip>
2523

24+
## PipelineQuantizationConfig
25+
26+
[[autodoc]] quantizers.PipelineQuantizationConfig
2627

2728
## BitsAndBytesConfig
2829

docs/source/en/quantization/overview.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,90 @@ Diffusers currently supports the following quantization methods.
3939
- [Quanto](./quanto.md)
4040

4141
[This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.
42+
43+
## Pipeline-level quantization
44+
45+
Diffusers allows users to directly initialize pipelines from checkpoints that may contain quantized models ([example](https://huggingface.co/hf-internal-testing/flux.1-dev-nf4-pkg)). However, users may want to apply
46+
quantization on-the-fly when initializing a pipeline from a pre-trained and non-quantized checkpoint. You can
47+
do this with [`~quantizers.PipelineQuantizationConfig`].
48+
49+
Start by defining a `PipelineQuantizationConfig`:
50+
51+
```py
52+
import torch
53+
from diffusers import DiffusionPipeline
54+
from diffusers.quantizers.quantization_config import QuantoConfig
55+
from diffusers.quantizers import PipelineQuantizationConfig
56+
from transformers import BitsAndBytesConfig
57+
58+
pipeline_quant_config = PipelineQuantizationConfig(
59+
quant_mapping={
60+
"transformer": QuantoConfig(weights_dtype="int8"),
61+
"text_encoder_2": BitsAndBytesConfig(
62+
load_in_4bit=True, compute_dtype=torch.bfloat16
63+
),
64+
}
65+
)
66+
```
67+
68+
Then pass it to [`~DiffusionPipeline.from_pretrained`] and run inference:
69+
70+
```py
71+
pipe = DiffusionPipeline.from_pretrained(
72+
"black-forest-labs/FLUX.1-dev",
73+
quantization_config=pipeline_quant_config,
74+
torch_dtype=torch.bfloat16,
75+
).to("cuda")
76+
77+
image = pipe("photo of a cute dog").images[0]
78+
```
79+
80+
This method allows for more granular control over the quantization specifications of individual
81+
model-level components of a pipeline. It also allows for different quantization backends for
82+
different components. In the above example, you used a combination of Quanto and BitsandBytes. However,
83+
one caveat of this method is that users need to know which components come from `transformers` to be able
84+
to import the right quantization config class.
85+
86+
The other method is simpler in terms of experience but is
87+
less-flexible. Start by defining a `PipelineQuantizationConfig` but in a different way:
88+
89+
```py
90+
pipeline_quant_config = PipelineQuantizationConfig(
91+
quant_backend="bitsandbytes_4bit",
92+
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
93+
components_to_quantize=["transformer", "text_encoder_2"],
94+
)
95+
```
96+
97+
This `pipeline_quant_config` can now be passed to [`~DiffusionPipeline.from_pretrained`] similar to the above example.
98+
99+
In this case, `quant_kwargs` will be used to initialize the quantization specifications
100+
of the respective quantization configuration class of `quant_backend`. `components_to_quantize`
101+
is used to denote the components that will be quantized. For most pipelines, you would want to
102+
keep `transformer` in the list as that is often the most compute and memory intensive.
103+
104+
The config below will work for most diffusion pipelines that have a `transformer` component present.
105+
In most case, you will want to quantize the `transformer` component as that is often the most compute-
106+
intensive part of a diffusion pipeline.
107+
108+
```py
109+
pipeline_quant_config = PipelineQuantizationConfig(
110+
quant_backend="bitsandbytes_4bit",
111+
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
112+
components_to_quantize=["transformer"],
113+
)
114+
```
115+
116+
Below is a list of the supported quantization backends available in both `diffusers` and `transformers`:
117+
118+
* `bitsandbytes_4bit`
119+
* `bitsandbytes_8bit`
120+
* `gguf`
121+
* `quanto`
122+
* `torchao`
123+
124+
125+
Diffusion pipelines can have multiple text encoders. [`FluxPipeline`] has two, for example. It's
126+
recommended to quantize the text encoders that are memory-intensive. Some examples include T5,
127+
Llama, Gemma, etc. In the above example, you quantized the T5 model of [`FluxPipeline`] through
128+
`text_encoder_2` while keeping the CLIP model intact (accessible through `text_encoder`).

src/diffusers/pipelines/pipeline_loading_utils.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -675,8 +675,10 @@ def load_sub_model(
675675
use_safetensors: bool,
676676
dduf_entries: Optional[Dict[str, DDUFEntry]],
677677
provider_options: Any,
678+
quantization_config: Optional[Any] = None,
678679
):
679680
"""Helper method to load the module `name` from `library_name` and `class_name`"""
681+
from ..quantizers import PipelineQuantizationConfig
680682

681683
# retrieve class candidates
682684

@@ -769,6 +771,17 @@ def load_sub_model(
769771
else:
770772
loading_kwargs["low_cpu_mem_usage"] = False
771773

774+
if (
775+
quantization_config is not None
776+
and isinstance(quantization_config, PipelineQuantizationConfig)
777+
and issubclass(class_obj, torch.nn.Module)
778+
):
779+
model_quant_config = quantization_config._resolve_quant_config(
780+
is_diffusers=is_diffusers_model, module_name=name
781+
)
782+
if model_quant_config is not None:
783+
loading_kwargs["quantization_config"] = model_quant_config
784+
772785
# check if the module is in a subdirectory
773786
if dduf_entries:
774787
loading_kwargs["dduf_entries"] = dduf_entries

src/diffusers/pipelines/pipeline_utils.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
from ..models import AutoencoderKL
4848
from ..models.attention_processor import FusedAttnProcessor2_0
4949
from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, ModelMixin
50+
from ..quantizers import PipelineQuantizationConfig
5051
from ..quantizers.bitsandbytes.utils import _check_bnb_status
5152
from ..schedulers.scheduling_utils import SCHEDULER_CONFIG_NAME
5253
from ..utils import (
@@ -725,6 +726,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
725726
use_safetensors = kwargs.pop("use_safetensors", None)
726727
use_onnx = kwargs.pop("use_onnx", None)
727728
load_connected_pipeline = kwargs.pop("load_connected_pipeline", False)
729+
quantization_config = kwargs.pop("quantization_config", None)
728730

729731
if torch_dtype is not None and not isinstance(torch_dtype, dict) and not isinstance(torch_dtype, torch.dtype):
730732
torch_dtype = torch.float32
@@ -741,6 +743,9 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
741743
" install accelerate\n```\n."
742744
)
743745

746+
if quantization_config is not None and not isinstance(quantization_config, PipelineQuantizationConfig):
747+
raise ValueError("`quantization_config` must be an instance of `PipelineQuantizationConfig`.")
748+
744749
if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
745750
raise NotImplementedError(
746751
"Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
@@ -1001,6 +1006,7 @@ def load_module(name, value):
10011006
use_safetensors=use_safetensors,
10021007
dduf_entries=dduf_entries,
10031008
provider_options=provider_options,
1009+
quantization_config=quantization_config,
10041010
)
10051011
logger.info(
10061012
f"Loaded {name} as {class_name} from `{name}` subfolder of {pretrained_model_name_or_path}."

0 commit comments

Comments
 (0)