You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-11-27-hadacore.md
+10-9Lines changed: 10 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ Quantization is a method for improving model inference speeds by compressing mod
17
17
18
18
The [HadaCore Kernel is publicly available](https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform).
19
19
20
-
**Background**
20
+
## Background
21
21
22
22
[QuaRot](https://arxiv.org/abs/2404.00456) and [SpinQuant](https://arxiv.org/abs/2405.16406) both propose methods to increase the numerical accuracy of INT4 and INT8 quantization in LLMs. Both methods rotate model activations since rotations are statistically likely to reduce the magnitude of outliers, as it “distributes” extreme values among other (less extreme) dimensions, and rotation is also an easily invertible operation using the inverse of the rotation matrix. These methods can also improve FP8 inference accuracy, such as in [FlashAttention-3](https://arxiv.org/pdf/2407.08608).
23
23
@@ -29,7 +29,7 @@ The [HadaCore Kernel is publicly available](https://github.com/pytorch-labs/appl
29
29
30
30
Applying these rotation matrices introduces model runtime overhead due to the online operations shown in Figure 2. These rotations can be applied through matrix multiplication, but the added overhead would diminish the benefits from quantization. Therefore, QuaRot and SpinQuant opt to use Walsh-Hadamard matrices, a special type of rotation matrix that can be applied faster than matrix multiplication using the [Fast Walsh-Hadamard Transform](https://en.wikipedia.org/wiki/Fast_Walsh%E2%80%93Hadamard_transform) algorithm. HadaCore is an optimized implementation of this algorithm for NVIDIA GPUs that support Tensor Cores.
31
31
32
-
**Tensor Core Accelerated Hadamard Transform**
32
+
## Tensor Core Accelerated Hadamard Transform
33
33
34
34
HadaCore leverages [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/), which are specialized compute units on NVIDIA GPUs optimized for matrix multiplication. To achieve this, our kernel performs a hardware-aware work decomposition of the Fast Walsh-Hadamard algorithm. This work decomposition ensures that we can utilize the [MMA PTX instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=mma#multiply-and-accumulate-instruction-mma) that execute on the Tensor Core chip. HadaCore applies a 16×16 Hadamard transform to chunks of the input data. The computation can then be offloaded to the FP16 Tensor Core with usage of the [mma.m16n8k16](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=mma#matrix-fragments-for-mma-m16n8k16-with-floating-point-type) instruction. The warp-level parallelism for HadaCore is shown below.
We process fragments of 256 elements in parallel using warp-level Tensor Core operations to achieve up to a 256-size Hadamard transform. For further sizes, we shuffle data between warps and repeat.
43
43
44
-
**Microbenchmarks**
44
+
## Microbenchmarks
45
45
46
46
We benchmark HadaCore against the[ Dao AI Lab Hadamard Kernel](https://github.com/Dao-AILab) on both NVIDIA H100 and A100 GPUs across varying Hadamard and input tensor sizes.
47
47
@@ -69,7 +69,7 @@ We benchmark HadaCore against the[ Dao AI Lab Hadamard Kernel](https://github.co
69
69
70
70
*Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline*
71
71
72
-
We showcase our speedup as the input tensor size (labeled element count) in our charts increase. Element count is the number of elements in the target matrix we are rotating. For example, in multi-head attention: \
72
+
We showcase our speedup as the input tensor size (labeled element count) in our charts increase. Element count is the number of elements in the target matrix we are rotating. For example, in multi-head attention:
73
73
74
74
75
75
The queries (Q), keys (K) and values (V) tensors are 4D tensors of size:
@@ -128,9 +128,9 @@ Common element counts for query rotations in an attention block:
128
128
</table>
129
129
130
130
131
-
HadaCore achieves **1.1–1.4x** speedup on A100 and **1.0–1.3x** speedup on H100 over Dao AI Lab’s Fast Hadamard kernel, with a peak gain of **3.5x and 3.6x, **respectively. For smaller sizes on H100, HadaCore’s gain decreases. For future work, we plan to incorporate usage of Hopper specific features like TMA and WGMMA for improved H100 performance.
131
+
HadaCore achieves **1.1–1.4x** speedup on A100 and **1.0–1.3x** speedup on H100 over Dao AI Lab’s Fast Hadamard kernel, with a peak gain of **3.5x and 3.6x**, respectively. For smaller sizes on H100, HadaCore’s gain decreases. For future work, we plan to incorporate usage of Hopper specific features like TMA and WGMMA for improved H100 performance.
132
132
133
-
**MMLU Benchmarks**
133
+
## MMLU Benchmarks
134
134
135
135
We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16.
136
136
@@ -195,11 +195,12 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
195
195
196
196
From the above MMLU scores, we note that for Llama3.1-8B inference with FP8 attention, HadaCore improves the quantization error introduced from computing attention in a lower precision.
197
197
198
-
**Conclusion**
198
+
## Conclusion
199
+
200
+
We showcased our speedups achieved by moving the Fast-Walsh Hadamard algorithm into a CUDA kernel that leverages Tensor Core acceleration and achieves a peak speedup of **3.5x** and **3.6x** over the Dao AI Fast-Hadamard kernel on NVIDIA A100 and H100, respectively.
199
201
200
-
We showcased our speedups achieved by moving the Fast-Walsh Hadamard algorithm into a CUDA kernel that leverages Tensor Core acceleration and achieves a peak speedup of **3.5x** and **3.6x **over the Dao AI Fast-Hadamard kernel on NVIDIA A100 and H100, respectively. \
201
202
Further, we showed on the MMLU benchmark that rotating with HadaCore maintains similar quantization error reduction to the Fast-Hadamard kernel, while providing computational acceleration.
202
203
203
-
**Future Work**
204
+
## Future Work
204
205
205
206
We plan to implement a Triton version of our kernel and experiment with more advanced techniques such as kernel fusion to support fused Hadamard transform and quantization. Further, we plan to extend our kernel to support BF16 Tensor Core compute.
0 commit comments