Skip to content

CogView4 Control Block #10809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Mar 15, 2025
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
a97fca2
1
zRzRzRzRzRzRzR Feb 17, 2025
c30ca7a
change to channel 1
zRzRzRzRzRzRzR Feb 17, 2025
5c25cd2
cogview4 control training
zRzRzRzRzRzRzR Feb 18, 2025
44bfd4c
add CacheMixin
zRzRzRzRzRzRzR Feb 18, 2025
a9f448e
1
zRzRzRzRzRzRzR Feb 18, 2025
2cbdf35
remove initial_input_channels change for val
zRzRzRzRzRzRzR Feb 18, 2025
df83bf2
1
zRzRzRzRzRzRzR Feb 18, 2025
8bba67a
update
zRzRzRzRzRzRzR Feb 18, 2025
b9d864b
use 3.5
zRzRzRzRzRzRzR Feb 18, 2025
5d2e994
new loss
zRzRzRzRzRzRzR Feb 18, 2025
ebeb1e4
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Feb 19, 2025
95e8504
1
zRzRzRzRzRzRzR Feb 19, 2025
940c23b
Merge branch 'cogview4_control' of https://github.com/zRzRzRzRzRzRzR/…
zRzRzRzRzRzRzR Feb 19, 2025
7a68a3e
use imagetoken
zRzRzRzRzRzRzR Feb 19, 2025
2a81772
for megatron convert
zRzRzRzRzRzRzR Feb 19, 2025
1d91a24
1
zRzRzRzRzRzRzR Feb 19, 2025
dff4b29
train con and uc
zRzRzRzRzRzRzR Feb 19, 2025
050b97c
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Feb 19, 2025
b007be0
2
zRzRzRzRzRzRzR Feb 19, 2025
25f4e4b
remove guidance_scale
zRzRzRzRzRzRzR Feb 20, 2025
7ffecbc
Update pipeline_cogview4_control.py
zRzRzRzRzRzRzR Feb 21, 2025
b4e11e7
fix
zRzRzRzRzRzRzR Feb 21, 2025
efa0f41
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Feb 21, 2025
f55e3cc
use cogview4 pipeline with timestep
zRzRzRzRzRzRzR Feb 21, 2025
9410e46
Merge branch 'cogview4_control' of https://github.com/zRzRzRzRzRzRzR/…
zRzRzRzRzRzRzR Feb 21, 2025
29b0c81
update shift_factor
zRzRzRzRzRzRzR Feb 24, 2025
52d4ebf
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Feb 25, 2025
65b3719
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Feb 26, 2025
90830ed
remove the uncond
zRzRzRzRzRzRzR Feb 26, 2025
71f9235
add max length
zRzRzRzRzRzRzR Feb 26, 2025
19d7d27
change convert and use GLMModel instead of GLMForCasualLM
zRzRzRzRzRzRzR Feb 27, 2025
fe6287a
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Feb 27, 2025
2f74c4e
fix
zRzRzRzRzRzRzR Feb 27, 2025
264060e
[cogview4] Add attention mask support to transformer model
OleehyO Feb 28, 2025
9a10ceb
[fix] Add attention mask for padded token
OleehyO Mar 4, 2025
b6e10e7
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Mar 5, 2025
692e5cc
update
zRzRzRzRzRzRzR Mar 5, 2025
fc3830c
remove padding type
zRzRzRzRzRzRzR Mar 6, 2025
98a2417
Update train_control_cogview4.py
zRzRzRzRzRzRzR Mar 6, 2025
c774f45
resolve conflicts with #10981
zRzRzRzRzRzRzR Mar 12, 2025
687faa4
Merge branch 'main' into cogview4_control
zRzRzRzRzRzRzR Mar 12, 2025
8abca19
add control convert
zRzRzRzRzRzRzR Mar 12, 2025
cbfeb0b
Merge branch 'cogview4_control' of https://github.com/zRzRzRzRzRzRzR/…
zRzRzRzRzRzRzR Mar 12, 2025
347dd17
use control format
zRzRzRzRzRzRzR Mar 13, 2025
775bb8c
fix
zRzRzRzRzRzRzR Mar 13, 2025
985baa9
add missing import
zRzRzRzRzRzRzR Mar 13, 2025
c2a1985
Merge branch 'huggingface:main' into cogview4_control
zRzRzRzRzRzRzR Mar 15, 2025
88abb39
update with cogview4 formate
zRzRzRzRzRzRzR Mar 15, 2025
3e3387e
make style
yiyixuxu Mar 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions examples/cogview4-control/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Training CogView4 Control

This (experimental) example shows how to train Control LoRAs with [CogView4](https://huggingface.co/THUDM/CogView4-6B) by conditioning it with additional structural controls (like depth maps, poses, etc.). We provide a script for full fine-tuning, too, refer to [this section](#full-fine-tuning). To know more about CogView4 Control family, refer to the following resources:

To incorporate additional condition latents, we expand the input features of CogView-4 from 64 to 128. The first 64 channels correspond to the original input latents to be denoised, while the latter 64 channels correspond to control latents. This expansion happens on the `patch_embed` layer, where the combined latents are projected to the expected feature dimension of rest of the network. Inference is performed using the `CogView4ControlPipeline`.

> [!NOTE]
> **Gated model**
>
> As the model is gated, before using it with diffusers you first need to go to the [CogView4 Hugging Face page](https://huggingface.co/THUDM/CogView4-6B), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:

```bash
huggingface-cli login
```

The example command below shows how to launch fine-tuning for pose conditions. The dataset ([`raulc0399/open_pose_controlnet`](https://huggingface.co/datasets/raulc0399/open_pose_controlnet)) being used here already has the pose conditions of the original images, so we don't have to compute them.

```bash
accelerate launch train_control_lora_cogview4.py \
--pretrained_model_name_or_path="THUDM/CogView4-6B" \
--dataset_name="raulc0399/open_pose_controlnet" \
--output_dir="pose-control-lora" \
--mixed_precision="bf16" \
--train_batch_size=1 \
--rank=64 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--use_8bit_adam \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=5000 \
--validation_image="openpose.png" \
--validation_prompt="A couple, 4k photo, highly detailed" \
--offload \
--seed="0" \
--push_to_hub
```

`openpose.png` comes from [here](https://huggingface.co/Adapter/t2iadapter/resolve/main/openpose.png).

You need to install `diffusers` from the branch of [this PR](https://github.com/huggingface/diffusers/pull/9999). When it's merged, you should install `diffusers` from the `main`.

The training script exposes additional CLI args that might be useful to experiment with:

* `use_lora_bias`: When set, additionally trains the biases of the `lora_B` layer.
* `train_norm_layers`: When set, additionally trains the normalization scales. Takes care of saving and loading.
* `lora_layers`: Specify the layers you want to apply LoRA to. If you specify "all-linear", all the linear layers will be LoRA-attached.

### Training with DeepSpeed

It's possible to train with [DeepSpeed](https://github.com/microsoft/DeepSpeed), specifically leveraging the Zero2 system optimization. To use it, save the following config to an YAML file (feel free to modify as needed):

```yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

And then while launching training, pass the config file:

```bash
accelerate launch --config_file=CONFIG_FILE.yaml ...
```

### Inference

The pose images in our dataset were computed using the [`controlnet_aux`](https://github.com/huggingface/controlnet_aux) library. Let's install it first:

```bash
pip install controlnet_aux
```

And then we are ready:

```py
from controlnet_aux import OpenposeDetector
from diffusers import CogView4ControlPipeline
from diffusers.utils import load_image
from PIL import Image
import numpy as np
import torch

pipe = CogView4ControlPipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")
pipe.load_lora_weights("...") # change this.

open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators")

# prepare pose condition.
url = "https://huggingface.co/Adapter/t2iadapter/resolve/main/people.jpg"
image = load_image(url)
image = open_pose(image, detect_resolution=512, image_resolution=1024)
image = np.array(image)[:, :, ::-1]
image = Image.fromarray(np.uint8(image))

prompt = "A couple, 4k photo, highly detailed"

gen_images = pipe(
prompt=prompt,
control_image=image,
num_inference_steps=50,
joint_attention_kwargs={"scale": 0.9},
guidance_scale=25.,
).images[0]
gen_images.save("output.png")
```

## Full fine-tuning

We provide a non-LoRA version of the training script `train_control_cogview4.py`. Here is an example command:

```bash
accelerate launch --config_file=accelerate_ds2.yaml train_control_cogview4.py \
--pretrained_model_name_or_path="THUDM/CogView4-6B" \
--dataset_name="raulc0399/open_pose_controlnet" \
--output_dir="pose-control" \
--mixed_precision="bf16" \
--train_batch_size=2 \
--dataloader_num_workers=4 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--use_8bit_adam \
--proportion_empty_prompts=0.2 \
--learning_rate=5e-5 \
--adam_weight_decay=1e-4 \
--report_to="wandb" \
--lr_scheduler="cosine" \
--lr_warmup_steps=1000 \
--checkpointing_steps=1000 \
--max_train_steps=10000 \
--validation_steps=200 \
--validation_image "2_pose_1024.jpg" "3_pose_1024.jpg" \
--validation_prompt "two friends sitting by each other enjoying a day at the park, full hd, cinematic" "person enjoying a day at the park, full hd, cinematic" \
--offload \
--seed="0" \
--push_to_hub
```

Change the `validation_image` and `validation_prompt` as needed.

For inference, this time, we will run:

```py
from controlnet_aux import OpenposeDetector
from diffusers import CogView4ControlPipeline, CogView4Transformer2DModel
from diffusers.utils import load_image
from PIL import Image
import numpy as np
import torch

transformer = CogView4Transformer2DModel.from_pretrained("...") # change this.
pipe = CogView4ControlPipeline.from_pretrained(
"THUDM/CogView4-6B", transformer=transformer, torch_dtype=torch.bfloat16
).to("cuda")

open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators")

# prepare pose condition.
url = "https://huggingface.co/Adapter/t2iadapter/resolve/main/people.jpg"
image = load_image(url)
image = open_pose(image, detect_resolution=512, image_resolution=1024)
image = np.array(image)[:, :, ::-1]
image = Image.fromarray(np.uint8(image))

prompt = "A couple, 4k photo, highly detailed"

gen_images = pipe(
prompt=prompt,
control_image=image,
num_inference_steps=50,
guidance_scale=25.,
).images[0]
gen_images.save("output.png")
```

## Things to note

* The scripts provided in this directory are experimental and educational. This means we may have to tweak things around to get good results on a given condition. We believe this is best done with the community 🤗
* The scripts are not memory-optimized but we offload the VAE and the text encoders to CPU when they are not used if `--offload` is specified.
* We can extract LoRAs from the fully fine-tuned model. While we currently don't provide any utilities for that, users are welcome to refer to [this script](https://github.com/Stability-AI/stability-ComfyUI-nodes/blob/master/control_lora_create.py) that provides a similar functionality.
6 changes: 6 additions & 0 deletions examples/cogview4-control/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
transformers==4.47.0
wandb
torch
torchvision
accelerate==1.2.0
peft>=0.14.0
Loading