Skip to content

Commit 5fc0bd6

Browse files
committed
tweaks
Signed-off-by: Chris Abraham <cjyabraham@gmail.com>
1 parent 6f484fe commit 5fc0bd6

File tree

1 file changed

+25
-24
lines changed

1 file changed

+25
-24
lines changed

_posts/2024-11-27-hadacore.md

Lines changed: 25 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: "HadaCore: Tensor Core Accelerated Hadamard Transform Kernel"
44
author: "IBM and Meta"
55
---
66

7-
**IBM**: Krish Agarwal, Rishi Astra, Adnan Hoque, Mudhakar Srivatsa, Raghu Ganti
7+
**IBM**: Krish Agarwal, Rishi Astra, Adnan Hoque, Mudhakar Srivatsa, Raghu Ganti
88
**Meta**: Less Wright, Sijia Chen
99

1010
Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers. Recent works like [QuaRot](https://arxiv.org/abs/2404.00456), [SpinQuant](https://arxiv.org/abs/2405.16406), and [FlashAttention-3](https://arxiv.org/pdf/2407.08608) introduce methods to increase the numerical accuracy of INT4, INT8 and FP8 quantization in LLMs. These methods rely on [Hadamard Transforms](https://en.wikipedia.org/wiki/Hadamard_transform). In this blog, we present HadaCore, a Hadamard Transform CUDA kernel that achieves state-of-the-art performance on NVIDIA A100 and H100 GPUs. Our kernel achieves speedups of **1.1–1.4x** and **1.0–1.3x**, with a peak gain of **3.5x** and **3.6x** respectively, over Dao AI Lab’s [Fast Hadamard Transform Kernel](https://github.com/Dao-AILab/fast-hadamard-transform). We leverage a hardware-aware work decomposition that benefits from Tensor Core acceleration while maintaining quantization error reduction.
@@ -52,19 +52,19 @@ We benchmark HadaCore against the[ Dao AI Lab Hadamard Kernel](https://github.co
5252
*Figure 5: HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel*
5353

5454

55-
![Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline](/assets/images/hadacore/fg5.png){:style="width:100%"}
55+
![Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline](/assets/images/hadacore/fg5.png){:style="width:100%; margin-top: 35px;"}
5656

5757

5858
*Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline*
5959

6060

61-
![Figure 4: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel](/assets/images/hadacore/fg6.png){:style="width:100%"}
61+
![Figure 4: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel](/assets/images/hadacore/fg6.png){:style="width:100%; margin-top: 35px;"}
6262

6363

6464
*Figure 4: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel*
6565

6666

67-
![Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline](/assets/images/hadacore/fg7.png){:style="width:100%"}
67+
![Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline](/assets/images/hadacore/fg7.png){:style="width:100%; margin-top: 35px;"}
6868

6969

7070
*Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline*
@@ -74,17 +74,16 @@ We showcase our speedup as the input tensor size (labeled element count) in our
7474

7575
The queries (Q), keys (K) and values (V) tensors are 4D tensors of size:
7676

77-
\
78-
`(batch_size, seq_len, n_heads, head_dim)` \
79-
\
77+
`(batch_size, seq_len, n_heads, head_dim)`
78+
8079
A Hadamard matrix of size `head_dim` is applied to these activation tensors, so we refer to this as using a Hadamard size of `head_dim` with an element count of:
8180

82-
`batch_size*seq_len*n_heads*head_dim`.
81+
`batch_size*seq_len*n_heads*head_dim.`
8382

8483
Common element counts for query rotations in an attention block:
8584

8685

87-
<table>
86+
<table class="table table-bordered">
8887
<tr>
8988
<td><strong>Model \ Tokens</strong>
9089
</td>
@@ -97,31 +96,32 @@ Common element counts for query rotations in an attention block:
9796
<td><strong>Llama-2 70b</strong>
9897
</td>
9998
<td>33,554,432 elements
100-
<p>
99+
<br>
101100
128 Hadamard size
102-
<p>
101+
<br>
102+
103103
(1 batch * 64 heads * 4096 tokens * 128 dimensional embeddings per head per token)
104104
</td>
105105
<td>8192 elements
106-
<p>
106+
<br>
107107
128 Hadamard size
108-
<p>
108+
<br>
109109
(1 batch * 64 heads * 1 token * 128 dimensional embeddings per head per token)
110110
</td>
111111
</tr>
112112
<tr>
113113
<td><strong>Llama-3 8b</strong>
114114
</td>
115115
<td>33,554,432 elements
116-
<p>
116+
<br>
117117
128 Hadamard size
118-
<p>
118+
<br>
119119
(1 batch * 32 heads * 8192 tokens * 128 dimensional embeddings per head per token)
120120
</td>
121121
<td>4,096 elements
122-
<p>
122+
<br>
123123
128 Hadamard size
124-
<p>
124+
<br>
125125
(1 batch * 32 heads * 1 token * 128 dimensional embeddings per head per token)
126126
</td>
127127
</tr>
@@ -132,24 +132,25 @@ HadaCore achieves **1.1–1.4x** speedup on A100 and **1.0–1.3x** speedup on H
132132

133133
**MMLU Benchmarks**
134134

135-
We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16. \
135+
We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16.
136+
136137
Our results show the benefit of using HadaCore for accuracy preservation when combined with optimizations such as FP8 FlashAttention.
137138

138139

139-
<table>
140+
<table class="table table-bordered">
140141
<tr>
141142
<td><strong>Format</strong>
142143
</td>
143144
<td><strong>Method</strong>
144145
</td>
145146
<td><strong>Llama3.1-8B</strong>
146-
<p style="text-align: center">
147+
<br>
147148
<strong>Avg. 5-Shot MMLU Accuracy</strong>
148149
</td>
149150
</tr>
150151
<tr>
151152
<td><strong>Q, K, V: FP16</strong>
152-
<p>
153+
<br>
153154
<strong>FlashAttention: FP16</strong>
154155
</td>
155156
<td>N/A
@@ -159,7 +160,7 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
159160
</tr>
160161
<tr>
161162
<td><strong>Q, K, V: FP16</strong>
162-
<p>
163+
<br>
163164
<strong>FlashAttention: FP8</strong>
164165
</td>
165166
<td>No Hadamard
@@ -169,7 +170,7 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
169170
</tr>
170171
<tr>
171172
<td><strong>Q, K, V: FP8</strong>
172-
<p>
173+
<br>
173174
<strong>FlashAttention: FP8</strong>
174175
</td>
175176
<td>HadaCore
@@ -179,7 +180,7 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
179180
</tr>
180181
<tr>
181182
<td><strong>Q, K, V: FP8</strong>
182-
<p>
183+
<br>
183184
<strong>FlashAttention: FP8</strong>
184185
</td>
185186
<td>Dao AI Fast Hadamard Kernel

0 commit comments

Comments
 (0)