Skip to content

Commit 61d223c

Browse files
authored
add: CUDA graph details. (#6408)
1 parent bf725e0 commit 61d223c

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

docs/source/en/tutorials/fast_diffusion.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,9 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
166166
image = pipe(prompt, num_inference_steps=30).images[0]
167167
```
168168

169-
`torch.compile` offers different backends and modes. As we’re aiming for maximum inference speed, we opt for the inductor backend using the “max-autotune”. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. Specifying fullgraph to be True ensures that there are no graph breaks in the underlying model, ensuring the fullest potential of `torch.compile`.
169+
`torch.compile` offers different backends and modes. As we’re aiming for maximum inference speed, we opt for the inductor backend using the “max-autotune”. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. Using CUDA graphs greatly reduces the overhead of launching GPU operations. It saves time by using a mechanism to launch multiple GPU operations through a single CPU operation.
170+
171+
Specifying fullgraph to be True ensures that there are no graph breaks in the underlying model, ensuring the fullest potential of `torch.compile`.
170172

171173
Using SDPA attention and compiling both the UNet and VAE reduces the latency from 3.31 seconds to 2.54 seconds.
172174

@@ -212,7 +214,7 @@ Through experimentation, we found that certain linear layers in the UNet and the
212214

213215
</Tip>
214216

215-
You will leverage the ultra-lightweight pure PyTorch library [torchao](https://github.com/pytorch-labs/ao) to use its user-friendly APIs for quantization.
217+
You will leverage the ultra-lightweight pure PyTorch library [torchao](https://github.com/pytorch-labs/ao) (commit SHA: 54bcd5a10d0abbe7b0c045052029257099f83fd9) to use its user-friendly APIs for quantization.
216218

217219
First, configure all the compiler tags:
218220

0 commit comments

Comments
 (0)